This is next post in series
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB
Air traffic queries in InfiniDB: early alpha

Let me explain the reason of choosing these engines. After initial three posts I am often asked “What is baseline ? Can we compare results with standard MySQL engines ?”. So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.

However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: – load of data is getting slower; – to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.

The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it’s not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.

What is so special about TokuDB ? There two things: indexes have special structure and are “cheap”, by “cheap” I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.

So what indexes we need for queries. To recall you details, the schema is available in this post
https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, and
queries I posted on sheet “Queries” in my summary Spreadsheet.

With Bradley’s help we chose next indexes:

And I measured load time for both MyISAM and TokuDB in empty table with created indexes.

Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec

Datasize (including indexes)

MyISAM: 36.7GB
TokuDB: 6.7GB

I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.

Now to queries. Bradley pointed me that query Q5

can be rewritten as

( I name it as Query Q5i)

The summary table with queries execution time (in sec, less is better):

QueryMyISAMTokuDB
Q072.8450.25
Q161.0355.01
Q298.1258.36
Q3123.0466.87
Q46.926.91
Q513.6111.86
Q5i7.686.96
Q6123.8469.03
Q7187.22159.62
Q8 (1y)8.757.59
Q8 (2y)102.1764.95
Q8 (3y)104.769.76
Q8 (4y)107.0570.46
Q8 (10y)119.5484.64
Q969.0547.67

For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.

And if you are interested to compare MyISAM with previous engines:

QueryMyISAMMonetDBInfoBrightLucidDBInfiniDB
Q072.8429.94.19103.21NA
Q161.037.912.1349.176.79
Q298.120.96.7327.134.59
Q3123.041.77.2927.664.96
Q46.920.270.992.340.75
Q513.610.52.927.35NA
Q6123.8412.521.8378.42NA
Q7187.2227.98.59106.37NA
Q8 (1y)8.750.551.746.768.13
Q8 (2y)102.171.13.6828.8216.54
Q8 (3y)104.71.695.4435.3724.46
Q8 (4y)107.052.127.2241.6632.49
Q8 (10y)119.5429.1417.4272.6770.35
Q969.056.30.3176.129.54

The all results are available in summary Spreadsheet

I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.

26 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Pawel

Hi,
what about innodb? Could you please benchmark this storage engine?

Alexander Mikhailian

Er… Benchmarking InnoDB is probably just pointless as it is very much OLTP-oriented, unlike MyISAM. Expect the results to be worse, and the datasize to be bigger.

Igor

Great series! Very useful for many people who are considering different DB alternatives.
So it seems that only Infobright and MonetDB have good speeds – and I suspect with larger RAM (like 16GB or more) MonetDB would do even better on some queries (Q8 – 10yrs almost for sure).
Are you planning to test GreenPlum singleNode – ideally on a real server (8 cores/16GB+ RAM etc) along with Infobright and MonetDB to see how they can use better hardware?

miata

Really enjoyed reading these articles as I am also in the process of evaluating the performance of column based databases on large data tables in the order of billions of rows. What I can find about this on most web sites sound more like marketing materials without real numbers, your experiments give good insight into the scale of performance gains that can be achieved.

Having installed Infobright ICE on a 8 core server, it seems to be that each query can only use one CPU core for processing. I wonder in your testing, can any of these columnar database engines take advantage of multi core CPU for performance gains?

Igor

GreenPlum is very much like PostgreSQL (but tuned for DW queries) – so data should be loaded with COPY command ( see for example
http://wiki.postgresql.org/wiki/How_to_use_PostgreSQL's_COPY_function_effectively ) and I suspect all queries should run without changes.
Installation on Linux from RPMs (http://www.greenplum.com/products/single-node/) seems very straightforward.
As to the tuning – maybe GreenPlum (company) who just released it would be willing to provide correct indexes and params?

Igor

And GreenPlum implemented 2 kinds of compression in latest version – row-based and column-based.
I suspect that fastest compression option (similar to gzip -1) should be used to save CPU time during loading.
Don’t know which one is better (row-based or column-based) but it seems really easy to convert between two.

Steven Roussey

What is a good database for tracking live stats? Not saving for rolling up later type stats, but live stats like page views and unique visitors that are displayed on the same pages that increment these counters.

Steven Roussey

Yes, though I mean a 50% read 50% write scenario. Every read is preceded by a write. I think TokuDB might be useful here, but I’m not sure of the others. I can post an idea for a schema.

Vadim — I get that you’re saying that InnoDB isn’t suitable, but wouldn’t showing the numbers prove it? It’s just confusing to me, because your last paragraph says:

“I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.”

But MyISAM isn’t known for being good at OLTP in general purposes. InnoDB is. It’s confusing because I expect InnoDB to be better than MyISAM at general OLTP, so why not benchmark TokuDB against InnoDB? Or was the last paragraph meant to say “even though TokuDB is better at this benchmark of an OLAP workload, it does OLTP well too”…?

Vlad Rodionov

Another good test would be on Database X (you know what DB I am referring to) 🙂 But we are waiting for GreenPlum.

Bradley C. Kuszmaul

Hi Vadim,

This experiment makes MyISAM look *much* better than it would be in
practice. There are two things happening.

1) The index build time is faster in this experiment than in practice.
When MyISAM loads data into an empty table it builds the index using
an I/O-efficient algorithm. But when loading data into an existing
table with indexes, MyISAM is much slower. Try the following
experiment to see:
* Load up half the data.
* Then load the other half of the data separately.
(Or load up 95% the data first, then load 5% more data)
I would expect the second load to be at least an order of magnitude
faster for TokuDB than MyISAM.

2) Given that the indexes were built with a single bulk load, the
MyISAM indexes are not fragmented. That makes the MyISAM indexes
nearly as good as the TokuDB indexes. If you were to load up the data
one month at a time, then the MyISAM insertions would not do so well,
and the MyISAM indexes would be fragmented, and would run an order of
magnitude slower. The longer MyISAM runs, the worse the fragmentation
becomes. In contrast, trickle-loaded TokuDB indexes exhibit no more
fragmentation than bulk-loaded indexes.

I haven’t seen many workloads where the system loads data
once-and-for-all, and then performs queries. Most workloads involve
quite a few insertions or updates. For MyISAM, incremental insertions
are slow and produce fragmented indexes. TokuDB can handle the
worst-case insertions without fragmentation or aging.

Saying that B-tree insertion performance falls exponentially with data
size may be overstating it, but it’s pretty bad. This experiment
doesn’t seem to support this assertion, however, and your readers
might wonder how to reconcile the assertion with the experiment. The
key is to distinguish between the best case and the worst case. In
the worst case, B-tree insertion performance becomes very bad. In the
best case, B-trees are pretty good. Whereas in in production, many
databases are performing close to worst-case insertion workloads, this
experiment is testing the best-case B-tree insertion workload.

-Bradley

Peter Zaitsev

Vadim,

I would note you loaded the data for all engines all at once so you need to compare apples to apples. I would expect things may change for everyone (at various extent) if data is loaded incrementally.

I also would suggest looking at larger data set some time in the future – in the current ran the compression has 2 effects, it is not only much less data to look at but also in memory fit may cause virtually in-memory workload for well compressed data. On larger data sets when you get say 10% instead of 1% of in memory fit things may be a bit different.

Martin Kersten

Dear Vadim

Before looking into Greenplum, a run against Postgresql would be insightful.
It would demonstrate later the relative gains over Greenplum.

regards, Martin

Greg Smith

I don’t really agree with Bradley’s characterization that “in production, many databases are performing close to worst-case insertion workloads”. Sure, there are some of those. But a good percentage of the data I see in production comes from sources that are time based, and that data tends to be loaded in well clustered clumps that are closer to best-case rather than worst. This is particularly true if you have enough RAM in the system that a decent chunk of index blocks can stay in there and not have to be updated on disk every time they’re touched, which is how a performance-oriented database should be provisioned. It’s only the case where you have a workload with really random UPDATE/INSERTs to data already in the index that really tend toward the worst of the B-tree behaviors, which was the case in Vadim’s extreme fragmentation example. I suspect that the BTS air traffic data won’t degrade anywhere near worst-case even if loaded in chunks, because that data is organized into blocks by time in its original files. I’d guess most of the work will be building new little index sub-trees and attaching them, rather than the worst-case behavior where you’re touching things all across existing index blocks.

Vadim

Something wrong with comments again.
I see some more comments to this post in my email, but not here.

There are comments from Sheeri K. Cabral (Pythian) and Alexander Mikhailian.

I will try to manage them to appear here…

I agree — InnoDB should do much worse than MyISAM…but I’d like to see how much worse. It would be good to be able to compare TokuDB to InnoDB in the case of an OLAP workload….TokuDB is supposed to be good for OLTP too, and you benchmarked that.

I just don’t see that it would take so much time to do — seems simple, no? (I have some benchmarks coming out on TokuDB very soon, so if you can’t do it that’s OK too.)

Vadim — great! (in general if people are asking for something, even if it seems silly, I’d rather just do it and show them how silly it is, if it doesn’t take me too much time.)

And getting the difference between how fast compiled-in InnoDB and InnoDB-plugin is a win too — most people only consider upgrading or changing big things like compiled vs. plugin if their db is slow 🙂