Air traffic queries in MyISAM and Tokutek (TokuDB)

This is next post in series
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB
Air traffic queries in InfiniDB: early alpha

Let me explain the reason of choosing these engines. After initial three posts I am often asked “What is baseline ? Can we compare results with standard MySQL engines ?”. So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.

However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: – load of data is getting slower; – to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.

The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it’s not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.

What is so special about TokuDB ? There two things: indexes have special structure and are “cheap”, by “cheap” I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.

So what indexes we need for queries. To recall you details, the schema is available in this post
https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, and
queries I posted on sheet “Queries” in my summary Spreadsheet.

With Bradley’s help we chose next indexes:

  KEY `Year` (`Year`,`Month`),
  KEY `Year_2` (`Year`,`DayOfWeek`),
  KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),
  KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),
  KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),
  KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),
  KEY `Origin` (`Origin`,`Year`,`DepDelay`)

KEY `Year` (`Year`,`Month`),

KEY `Year_2` (`Year`,`DayOfWeek`),

KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),

KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),

KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),

KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),

KEY `Origin` (`Origin`,`Year`,`DepDelay`)

And I measured load time for both MyISAM and TokuDB in empty table with created indexes.

Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec

Datasize (including indexes)

MyISAM: 36.7GB
TokuDB: 6.7GB

I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.

Now to queries. Bradley pointed me that query Q5

SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,
count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY
carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE
Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3

SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,

count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY

carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE

Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3

can be rewritten as

SELECT carrier,totalflights,ndelayed,ndelayed*1000/totalflights as c3 FROM (SELECT carrier,count(*) as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc;

1	SELECT carrier,totalflights,ndelayed,ndelayed1000/totalflights as c3 FROM (SELECT carrier,count() as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc;

( I name it as Query Q5i)

The summary table with queries execution time (in sec, less is better):

Query	MyISAM	TokuDB
Q0	72.84	50.25
Q1	61.03	55.01
Q2	98.12	58.36
Q3	123.04	66.87
Q4	6.92	6.91
Q5	13.61	11.86
Q5i	7.68	6.96
Q6	123.84	69.03
Q7	187.22	159.62
Q8 (1y)	8.75	7.59
Q8 (2y)	102.17	64.95
Q8 (3y)	104.7	69.76
Q8 (4y)	107.05	70.46
Q8 (10y)	119.54	84.64
Q9	69.05	47.67

For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.

And if you are interested to compare MyISAM with previous engines:

Query	MyISAM	MonetDB	InfoBright	LucidDB	InfiniDB
Q0	72.84	29.9	4.19	103.21	NA
Q1	61.03	7.9	12.13	49.17	6.79
Q2	98.12	0.9	6.73	27.13	4.59
Q3	123.04	1.7	7.29	27.66	4.96
Q4	6.92	0.27	0.99	2.34	0.75
Q5	13.61	0.5	2.92	7.35	NA
Q6	123.84	12.5	21.83	78.42	NA
Q7	187.22	27.9	8.59	106.37	NA
Q8 (1y)	8.75	0.55	1.74	6.76	8.13
Q8 (2y)	102.17	1.1	3.68	28.82	16.54
Q8 (3y)	104.7	1.69	5.44	35.37	24.46
Q8 (4y)	107.05	2.12	7.22	41.66	32.49
Q8 (10y)	119.54	29.14	17.42	72.67	70.35
Q9	69.05	6.3	0.31	76.12	9.54

The all results are available in summary Spreadsheet

I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.

26 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Pawel

14 years ago

Hi,
what about innodb? Could you please benchmark this storage engine?

Alexander Mikhailian

14 years ago

Er… Benchmarking InnoDB is probably just pointless as it is very much OLTP-oriented, unlike MyISAM. Expect the results to be worse, and the datasize to be bigger.

Igor

14 years ago

Great series! Very useful for many people who are considering different DB alternatives.
So it seems that only Infobright and MonetDB have good speeds – and I suspect with larger RAM (like 16GB or more) MonetDB would do even better on some queries (Q8 – 10yrs almost for sure).
Are you planning to test GreenPlum singleNode – ideally on a real server (8 cores/16GB+ RAM etc) along with Infobright and MonetDB to see how they can use better hardware?

Vadim Tkachenko

Author

14 years ago

Pawel,

I can benchmark InnoDB, but I am all time trying to say InnoDB is not suitable for this task and these queries. The same valid and for MyISAM, I only posted numbers to show that standard MySQL is not good in handling analytic queries.

Vadim Tkachenko

Author

14 years ago

Igor,

I am going to run Star Schema Benchmarks on more powerful box 16GB, 8 cores, will see how it works.

On GreenPlum, I’d like to try it, but it seems it may take some time for me to figure out how to install and load data. If one provide me instructions I will be happy to run.

miata

14 years ago

Really enjoyed reading these articles as I am also in the process of evaluating the performance of column based databases on large data tables in the order of billions of rows. What I can find about this on most web sites sound more like marketing materials without real numbers, your experiments give good insight into the scale of performance gains that can be achieved.

Having installed Infobright ICE on a 8 core server, it seems to be that each query can only use one CPU core for processing. I wonder in your testing, can any of these columnar database engines take advantage of multi core CPU for performance gains?

Igor

14 years ago

GreenPlum is very much like PostgreSQL (but tuned for DW queries) – so data should be loaded with COPY command ( see for example
http://wiki.postgresql.org/wiki/How_to_use_PostgreSQL's_COPY_function_effectively ) and I suspect all queries should run without changes.
Installation on Linux from RPMs (http://www.greenplum.com/products/single-node/) seems very straightforward.
As to the tuning – maybe GreenPlum (company) who just released it would be willing to provide correct indexes and params?

Igor

14 years ago

And GreenPlum implemented 2 kinds of compression in latest version – row-based and column-based.
I suspect that fastest compression option (similar to gzip -1) should be used to save CPU time during loading.
Don’t know which one is better (row-based or column-based) but it seems really easy to convert between two.

Steven Roussey

14 years ago

What is a good database for tracking live stats? Not saving for rolling up later type stats, but live stats like page views and unique visitors that are displayed on the same pages that increment these counters.

Vadim Tkachenko

Author

14 years ago

Igor,

I will take look on GreenPlum, though I do not promise it will be soon. I am going to ask on their forum if there is interest to help me.

Vadim Tkachenko

Author

14 years ago

Steven,

That’s good area, and actually the point of this series is to figure it out. If you look on air traffic records in big picture it is not different from records for web site visitors.
Which database is the best – I show data, you have a choice.
I will probably take schema with real web site visitors, it will try it also.

Steven Roussey

14 years ago

Yes, though I mean a 50% read 50% write scenario. Every read is preceded by a write. I think TokuDB might be useful here, but I’m not sure of the others. I can post an idea for a schema.

Vadim Tkachenko

Author

14 years ago

Steven,

That would be good. Do you have enough data btw ?
I am thinking to have about 100GB data, but I can’t find logs that I can allow public access.

Sheeri K. Cabral (Pythian)

14 years ago

Vadim — I get that you’re saying that InnoDB isn’t suitable, but wouldn’t showing the numbers prove it? It’s just confusing to me, because your last paragraph says:

“I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.”

But MyISAM isn’t known for being good at OLTP in general purposes. InnoDB is. It’s confusing because I expect InnoDB to be better than MyISAM at general OLTP, so why not benchmark TokuDB against InnoDB? Or was the last paragraph meant to say “even though TokuDB is better at this benchmark of an OLAP workload, it does OLTP well too”…?

Vlad Rodionov

14 years ago

Another good test would be on Database X (you know what DB I am referring to) 🙂 But we are waiting for GreenPlum.

Bradley C. Kuszmaul

14 years ago

Hi Vadim,

This experiment makes MyISAM look *much* better than it would be in
practice. There are two things happening.

1) The index build time is faster in this experiment than in practice.
When MyISAM loads data into an empty table it builds the index using
an I/O-efficient algorithm. But when loading data into an existing
table with indexes, MyISAM is much slower. Try the following
experiment to see:
* Load up half the data.
* Then load the other half of the data separately.
(Or load up 95% the data first, then load 5% more data)
I would expect the second load to be at least an order of magnitude
faster for TokuDB than MyISAM.

2) Given that the indexes were built with a single bulk load, the
MyISAM indexes are not fragmented. That makes the MyISAM indexes
nearly as good as the TokuDB indexes. If you were to load up the data
one month at a time, then the MyISAM insertions would not do so well,
and the MyISAM indexes would be fragmented, and would run an order of
magnitude slower. The longer MyISAM runs, the worse the fragmentation
becomes. In contrast, trickle-loaded TokuDB indexes exhibit no more
fragmentation than bulk-loaded indexes.

I haven’t seen many workloads where the system loads data
once-and-for-all, and then performs queries. Most workloads involve
quite a few insertions or updates. For MyISAM, incremental insertions
are slow and produce fragmented indexes. TokuDB can handle the
worst-case insertions without fragmentation or aging.

Saying that B-tree insertion performance falls exponentially with data
size may be overstating it, but it’s pretty bad. This experiment
doesn’t seem to support this assertion, however, and your readers
might wonder how to reconcile the assertion with the experiment. The
key is to distinguish between the best case and the worst case. In
the worst case, B-tree insertion performance becomes very bad. In the
best case, B-trees are pretty good. Whereas in in production, many
databases are performing close to worst-case insertion workloads, this
experiment is testing the best-case B-tree insertion workload.

-Bradley

Vadim Tkachenko

Author

14 years ago

Bradley,

I agree that this is best case scenario for MyISAM , and with fragmented indexes the result may be much worse for load time and for query execution. For queries yesterday I posted similar problem with InnoDB http://www.mysqlperformanceblog.com/2009/11/05/innodb-look-after-fragmentation/

I will re-run load with chunk-per-year to compare how it affects load time.

Though I should say the scenario I show here is not fully invalid. I’ve seen DW solution bases on MyISAM, when data is load in periodically. And say if we load data once per months, and full load time is only couple hours, it is worth to re-load data entirely to get optimal index structure.

Peter Zaitsev

Admin

14 years ago

Vadim,

I would note you loaded the data for all engines all at once so you need to compare apples to apples. I would expect things may change for everyone (at various extent) if data is loaded incrementally.

I also would suggest looking at larger data set some time in the future – in the current ran the compression has 2 effects, it is not only much less data to look at but also in memory fit may cause virtually in-memory workload for well compressed data. On larger data sets when you get say 10% instead of 1% of in memory fit things may be a bit different.

Vadim Tkachenko

Author

14 years ago

Peter,

I am open for bigger dataset, though I have hard time finding it. Air traffic data is the biggest one (well, there is data take from telescopes, but I do not see how that fits into analytic databases) available.
For sure I can use (and will) synthetic generated data, but it is less interesting for me.

Martin Kersten

14 years ago

Dear Vadim

Before looking into Greenplum, a run against Postgresql would be insightful.
It would demonstrate later the relative gains over Greenplum.

regards, Martin

Greg Smith

14 years ago

I don’t really agree with Bradley’s characterization that “in production, many databases are performing close to worst-case insertion workloads”. Sure, there are some of those. But a good percentage of the data I see in production comes from sources that are time based, and that data tends to be loaded in well clustered clumps that are closer to best-case rather than worst. This is particularly true if you have enough RAM in the system that a decent chunk of index blocks can stay in there and not have to be updated on disk every time they’re touched, which is how a performance-oriented database should be provisioned. It’s only the case where you have a workload with really random UPDATE/INSERTs to data already in the index that really tend toward the worst of the B-tree behaviors, which was the case in Vadim’s extreme fragmentation example. I suspect that the BTS air traffic data won’t degrade anywhere near worst-case even if loaded in chunks, because that data is organized into blocks by time in its original files. I’d guess most of the work will be building new little index sub-trees and attaching them, rather than the worst-case behavior where you’re touching things all across existing index blocks.

Vadim

14 years ago

Something wrong with comments again.
I see some more comments to this post in my email, but not here.

There are comments from Sheeri K. Cabral (Pythian) and Alexander Mikhailian.

I will try to manage them to appear here…

Vadim Tkachenko

Author

14 years ago

Hmm, Sheeri, your comment came to my mailbox only today, but WordPress says it was posted Nov-6,

Let me comment it now.

I think I explained why I took TokuDB. It has interesting index structure which may make it more
suitable than MyISAM and InnoDB, and you actually can see that from queries time.

Load time is not good for TokuDB, but as Bradley comments it, it depends how you do load.

I made another run where I loaded data in year-per-chunk way (in contrast all years-at-once as in original post), it took over 9h for MyISAM, but for TokuDB the same 5.5h

Sheeri K. Cabral (Pythian)

14 years ago

I agree — InnoDB should do much worse than MyISAM…but I’d like to see how much worse. It would be good to be able to compare TokuDB to InnoDB in the case of an OLAP workload….TokuDB is supposed to be good for OLTP too, and you benchmarked that.

I just don’t see that it would take so much time to do — seems simple, no? (I have some benchmarks coming out on TokuDB very soon, so if you can’t do it that’s OK too.)

Vadim Tkachenko

Author

14 years ago

Sheeri,

Ok, I am sold 🙂 I will try to manage InnoDB setup before Portland, if not – then after that.
Would is really interesting here for me how fast indexes from InnoDB-plugin will work here.

Sheeri K. Cabral (Pythian)

14 years ago

Vadim — great! (in general if people are asking for something, even if it seems silly, I’d rather just do it and show them how silly it is, if it doesn’t take me too much time.)

And getting the difference between how fast compiled-in InnoDB and InnoDB-plugin is a win too — most people only consider upgrading or changing big things like compiled vs. plugin if their db is slow 🙂

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Air traffic queries in MyISAM and Tokutek (TokuDB)

Related

Related Blog Articles

RECOMMENDED ARTICLES

Securing Your MySQL Database: Essential Best Practices

MySQL 8.4 First Peek

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Air traffic queries in MyISAM and Tokutek (TokuDB)

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Securing Your MySQL Database: Essential Best Practices

MySQL 8.4 First Peek

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation