July 25, 2014

Testing MySQL column stores

Recently I had the opportunity to do some testing on a large data set against two MySQL column-store storage engines. I’d like to note that this effort was sponsored by Infobright, but this analysis reflects my independent testing from an objective viewpoint.

I performed two different types of testing. The first focused on core functionality and compatibility of ICE (Infobright Community Edition) compared with MyISAM on a small data set. The second part of my testing compared the performance and accuracy of ICE with InfiniDB Community Edition on a 950GB data set.

The first first part of my analysis focused on testing specific MySQL functionally with Infobright’s storage engine. A lot of my tests involved corner or edge cases since I have experience developing such cases for MySQL column-based storage engines in the past. I reported any bugs that I found, and contributed my test cases to ICE. In fact, some of the issues have already been addressed in the most recent software release. An example of such a problem would be “select avg(char_column) from table” where the column contains a mix of ascii data, such as names and numeric data. This is an example of implicit casting that probably wouldn’t happen in a real application.

Importantly, I didn’t find significant defects in Infobright that would be “show stoppers” for typical OLAP analysis queries. These tests were intended to approximate what would happen if you ported a MyISAM OLAP application to ICE. My testing suggests that with some basic testing, an application could be ported to ICE with a good chance of success. Of course, a good test environment is something that I think every operations team should insist on.

The second part of my analysis focused on testing a total of 29 queries on the large data set. I compared a number of different factors between the two databases, including:

  1. Ease of installation
  2. Loading capability and speed of loading
  3. Accuracy of results of queries over the large data set
  4. Speed of results of queries over the large data set
  5. Basic security aspects

The report that I produced may be found here.

It should be noted that the second set of tests included 29 different queries, some of which were provided by Infobright and others which I contributed. InfiniDB does not support as many data types and aggregate functions as ICE, and therefore it could not run some of the queries. ICE supports almost all of the MySQL aggregation functions. Notably, GROUP_CONCAT is not supported, which is something I hope they rectify in a future version.

In addition, I was not able to get accurate results for all of the queries on InfiniDB. In particular the query “select count(*) from carsales.sales_fact” when run on InfiniDB returned a number that was higher than it should have been, and several GROUP BY queries returned unexpected results as well. I did not change any “out of the box” settings for Infobright. Even after I modified the configuration settings on a 16GB box, one query did not have enough memory to complete on InfiniDB.

Overall, I would say that ICE is more “ready for prime time” than InfiniDB given the inconsistencies that I encountered. I will try to reproduce the problems on InfiniDB into easily reproducible test cases which I may contribute to them, but this is difficult given the size of the data set involved. ICE was able to execute queries quickly, and with accurate results.

About Justin Swanhart

Comments

  1. Allan WInd says:

    Do you, by any chance, have query times against mysql on similar hardware?

  2. Michael Peters says:

    The big question (at least for me) is how does these storage engines compare to MyISAM and InnoDB/ExtraDB performance and size wise?

  3. Michael,

    MyISAM and InnoDB are best used for OLTP type apps, while Infobright is specifically designed for analytic queries. See a performance comparison for analytics another company did at http://www.slideshare.net/truenorthgb/using-infobright-community-edition-for-analytics

    As far as size, Infobright scales up to tens of terabytes of data, and has a typical compression ratio of 10: 1 to 40:1 and sometimes higher.

  4. Justin Swanhart says:

    Hi Michael,

    The Infobright storage engine (ICE) is a column store. MyISAM and InnoDB are row stores. Because ICE stores data differently, it can apply RLE (run-length encoding) compression to the database, something that MyISAM and InnoDB can’t do. This means that data stored in ICE will probably be significantly smaller than that stored in other storage engines, even those that support compression.

    Keep in mind that Infobright is designed for analytical type queries. Like any column store, it works best when a small number of columns needs to be accessed from a table that has many columns.

    As for a MyISAM/InnoDB comparison, it does not make sense to test the same queries on MyISAM/InnoDB since most of them would take significantly longer to run. Just loading the data would take significantly longer, since MyISAM/InnoDB would need indexes to run the queries. Neither ICE nor InfiniDB need to create indexes, in fact, neither database even supports them.

  5. Nice write-up Justin, and props to Percona and Infobright for allowing publication. I am surprised that InfiniDB had *so many* errors.

  6. Justin Swanhart says:

    The biggest reason there are so many problems is that aggregate functions outside of SUM/MIN/MAX/AVG are not supported. Many of the tests use VAR_SAMP, VAR_POP, STDDEV, etc, and those queries can’t run on InfiniDB yet. Hopefully InfiniDB adds supports for the rest of the functions in a future release.

  7. erin says:

    Cool write up. I’m using Infobright (and have been for over a year now) in production. Our schema is all INTs & decimals so our compression rate is very high. The thing I’ve discovered is that some functions are fully optimized for ICE (count, avg, sum, etc) while others are “sort of optimized”. The “sort of optimized” functions seem to slow down as you add more rows. I’m working on having the engineers pull that logic out of the query (SELECT blah WHEN ### then field = # END field2, field3 FROM mytable GROUP by field6,field2, field3) The CASE stuff is starting to slow down as are the queries with 50% of the col returned.

    ICE sure is sweet though it does have it’s problem areas which one just has to code around.

  8. Hi Justin,

    Interesting document, thanks for sharing useful info with the community!

    One concern I have is that you used both dataset and queries provided by one of the test participants, making it possibly a bit biased.

    Is the test data available somewhere? 900GB is a lot, but hey, it’s 21st century :) We would be curious to test it on our Ingres VectorWise product. Perhaps you would be interested in working with us on it? I know in the past you did play with non-MySQL products.

    Regards,
    Marcin

  9. Justin Swanhart says:

    Hi Marcin,

    Some of the queries were provided by Infobright, and I contributed some as well. I have plans for follow-on-testing which will include a data generator so that the results can be compared to other third party engines more easily. The testing between ICE and InfiniDB was done using ‘mysqltest’ which made it easy to run, time, and compare the results between the engines. A different testing system will be needed for third party databases.

    I’d definitely like to look more closely at Vectorwise.

  10. erin says:

    Justin — I use mysqlslap on ICE …. I haven’t looked at mysqltest … but now I will!

  11. Hi Justin,

    Thanks for the clarification.

    Good to hear you plan more tests, we’ll keep watching your blog. It’s always good to have more test cases. If you need any help with getting started with VectorWise, feel free to contact me :)

    Best,
    Marcin

  12. Bob Zurek says:

    Full disclosure, I’m a tech evangelist. Always great to see the feedback and comments on this and other great topics on this site. This blog is always great for lively discussions especially when users point out their success in real world deployments (thanks Erin).
    Database benchmarks have traditionally been used to test and compare performance attributes of a system and they can be quite valuable. The bad news is that if doesn’t test key functional capabilities then no matter how “fast” your solution is, if it can’t handle the functional part of the requirement, then “fast” doesn’t matter. The real power comes with the system is not only fast (in proven real world solutions) but also functionally rich. Infobright has made tremendous progress over the past year in not only expanding the functionality of the engine but also the performance. I encourage you to take a look at our Youtube Channel at http://www.youtube.com/infobrightdb to learn more about the database.
    Customer scenarios change. We have a great fit in the web, mobile, IT and financial analytics for where 85% of the market is right now with their data (gigabytes to 10′s of TB). There is always going to be extreme edge cases (think smaller and think big, big like petabytes) and this is NOT where we focus. For those big petabyte size systems, you have some good choices.
    Appreciate the opportunity to participate in this discussion and we thanks the team at Percona for working with us on this project. Percona’s expertise and objectivity was stellar on this project with us at Infobright.

  13. Gor says:

    It is very interesting report. However I am a bit concerned that “this effort was sponsored by Infobright” and “tests included 29 different queries, some of which were provided by Infobright”. Is there any chance you can let us know how many queries and specifically which ones where contributed by InfoBright just so we can see how objective this report is? Many thanks

  14. Daniel,Wu says:

    That’s interesting, but in the following link, most queries run faster on infinidb
    http://www.percona.com/docs/wiki/benchmark:ssb:start

    Why the performance is different in your test cases, which test case is correct?

  15. Serge says:

    Does anyone have any experience or benchmarks using Infobright or InfiniDB for high-frequency row selections with minimal aggregate functions? Specifically, this type of query: select float, datetime, datetime where int=A and datetime<B and datetime<C order by datetime, datetime limit 0,1 run thousands of times per second. Do column designs have an advantage over MyISAM with this type of query?

Speak Your Mind

*