(Note: Review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)

I had a chance to take look TokuDB (the name of the Tokutek storage engine), and run some benchmarks. Tuning of TokuDB is much easier than InnoDB, there only few parameters to change, and actually out-of-box things running pretty well.

There are some rumors circulating that TokuDB is ”.. only an in memory or read-only engine, and that’s why inserts are so fast”. This is not actually the case, as TokuDB is a disk-based, read-write transactional storage engine that is based on special “fractal tree indexes”. Fractal Trees are a drop-in-replacement for a B-tree (based on current research in data structures by professors at Stony Brook, Rutgers, and MIT). I can’t say exactly how it is improved, because the engine itself is closed source.

Along with its “fractal tree indexes”, TokuDB also uses compression, which significantly ( Graph 1) reduces dataset and decreases the amount of IO operations. The benefit of small size is also that TokuDB can keep in memory much more records then InnoDB / MyISAM. Actually in internal cache records are stored in uncompressed form, but OS Cache can keep compressed pages in memory. For the data set we tested, TokuDB used 6.2x less disk space than InnoDB, and 5.5x less disk space than MyISAM.

For tests I used Dell PowerEdge R900, with RAID 10 on 8 disks (2.5″ SAS disks, 15K RPMS) and 32GB of physical RAM, but restricted on kernel level to 4GB to emulate case B-Tree does not fit into memory.
As benchmark software I tried iiBench, which you can take there https://launchpad.net/mysql-patch/mytools

What makes fractal indexes so interesting is the amount of IO operations to update index tree is significantly less than for usual B-Tree index. It’s as if Fractal Trees turn random IO into sequential IO. This is why you see the results that you do in iiBench test ( Graph 2), and the number of inserts/sec is almost linear even when table size bigger than available memory. For the last 10M rows inserted, InnoDB averaged 1,555 rows/sec while TokuDB averaged 16,437 rows/sec – about 10.6x faster. One consequence of having such fast indexes, is that you can maintain a richer set of indexes at a given incoming data rate, enabling much higher query performance.

Beside iiBench we run benchmarks of SELECT queries again one of our click analyzing schema, in two modes – 1. data size is much more then memory and 2. data fits into memory.

As you see in IO-bound case TokuDB outperforms InnoDB 1.4-2.5x times, but CPU-bound is not so good. I think there we meet one of current restrictions of TokuDB – SERIALIZABLE isolation level for transactions.

Speaking about restrictions, the current problems I see are:
– Transactions only support the SERIALIZABLE isolation level. Beside it TokuDB does not scale well on multi-cores even in only SELECT queries. What this practically means it that you can’t get benefit of multi-core boxes running concurrent threads. Tokutek plans to fix this in one of the next releases.
– We did not tested wide range of queries, but by design expect there may be not good results for some kind of queries, i.e. point select queries, as in this case TokuDB has to read and decompress big portion of data.
– Despite Inserts and Deletes are fast, updates are not expected to show the same performance gain, as to update we need to read data, and in this case – read previous comment.
– The version we tested did not yet support recovery logs. The code for it is ready, and will be available in a release soon.
– The ways to do a backup is mysqldump/mysqlhotcopy. It is not fully transparent backup, as it applied TABLE LOCK on copying table. When recovery logs are supported, I guess it will be possible to run LVM backup. Actually I would say backup is only partially the problem of storage engine. The biggest problem is that MySQL does not yet provide an interface for that. This is going to be fixed in MySQL 6.0, but I can’t yet say how it will work with mix of storage engines.
– The Tokutek engine I tested comes in binary form and mysqld binary does not contain InnoDB. Tokutek tells me that InnoDB will be included in a future release.

With all the given advantages and drawbacks, I see a good practical usage of TokuDB for log analyzing and log reporting queries. By log analyzing I mean any kind of log producing application, it can be from simple apache logs put into mysql, application performance logs to more complex log like clicks, user movements and actions on site, visits tracking etc. While it may sound like an easy and trivial task, it is not at all. The more logs there are, the more space they take, and we have had setups where logs are 80% of total database size. Also there is the problem of being able to run custom reporting queries on logs. To do this, you often need many, often complex indexes which gives us the problem of random IO, waste of RAM memory and slow inserts. This is where I think Tokutek appears to be positioned to do quite good at. There are operation issues which make things more complex and, probably, I would not put yet TokuDB on customer production boxes, but it may good fit to non-critical slave where you can run analyzing queries.

21 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Peter Zaitsev

Thanks,

As it just came to my mind the performance of TokuDB at some extent resembles one of google bigtable (or SStables design) – they also have very fast inserts avoiding sequential writes, compression and rather fast scans. Same as TokuDB they excel in fast insertion reads and ordered scans though random point lookups and updates are slow.

As we know great applications built on the BigTable which seconds to the fact there can be a lot of application patterns built using TokuDB characteristics to advantage.

The compression use also allows to fit more data in the same size which should allow storing multiple “sort orders” in the same table allowing to access all columns while scanning.

It is also interesting to see how TokuDB compares to Infobright and similar engines – the focus as I understand is different as TokuDB being row store engine focuses on application with real time inserts while InfoBright being column store engine focuses on analytic queries.

One thing I would also highlight about the insertion benchmarks – this is single thread run which is especially bad on RAID10 – single thread does not allow to use all hard drive efficiently with Innodb.

Antony

Perhaps some useful links about the “Fractal tree indexes” would be useful to people…

Try using Google searching for “fpB+-Tree”… A couple of links I have found this way are:
http://www.pdl.cmu.edu/ftp/Database/fpbtree.ps
http://docs.google.com/Presentation?docid=dgm5bc3m_0qpsnmqc9

Interesting stuff…

Antony

Bah… the second link is bogus… seems on my clicking around, the original PowerPoint presentation was converted into a Google Docs presentation…. anyways. Hunt around starting with the “fpB+-Tree” search term and you’ll find it.

You didn’t compare it with XtraDB?

Vadim

Unfortunately XtraDB is not yet so popular as InnoDB, and actually in this case I do not expect big difference, as B-tree on disk is B-tree on disk.

Guillaume Theoret

If you want more info on fractal trees you should read the papers here: http://supertech.csail.mit.edu/cacheObliviousBTree.html

Henrik Ingo

From my limited experience: While the fractal tree index thing sounds sexy, and may contributo to why insert performance is good, I expect the high compression rate is actually the biggest reason TokuDB performs so well on an insert or io-bound test.

For instance, Infobright will give you the same benefits. Also there the reason is a combination of indexes and compression. (But infobright would perform poorly on selects, so not relevant here from that p.o.v.)

Mark Callaghan

@Henrik — My guess is that the benefit from compression is a constant factor and the bigger benefit is from the fractal tree. Two of the technical founders worked through the math with me at the conference for a cache oblivious algorithm. We left compression out to keep it simple. The benefit is real and random IO has been removed from the insert part of the workload. We need more test results to show how this performs on queries.

The fractal tree isn’t the only way to do this but it appears to be the best way — log structured merge trees can also be used and Google Bigtable uses something like that, only they have not provided the algorithm analysis.

Henrik Ingo

@Mark: Like you keep saying, math is hard, so until I really have a need to understand this, I’ll take your word for it 🙂

Mark Callaghan

I think we should agitate until Tokutek provides a white paper suitable for people who understand algorithms but are not experts in cache oblivious algorithms. The purpose is to explain that this is possible, not give away secrets.

Baron Schwartz

While reviewing a draft of this blog post, I asked them for some academic sources we could cite, and they warned me I’d never be able to understand it. I haven’t even opened up the papers they referred me to. I think they could really help the cause by drawing diagrams of how it works, and seeing if they can explain the concepts to ordinary mortals. I could dust off my advanced math and give it a shot, but I have a gut feeling it’s possible to understand without that, at least at an intuitive level. And some people would be satisfied with an intuitive understanding — enough to buy, I’d guess.

Guillaume Theoret

Could we please have this list of papers they referenced?

Martin Farach-Colton

Mark & Baron, thanks for the suggestion of a white paper. I’ll get to that ASAP.

In the mean time, her are links to the papers (any difficulty in reading them is certainly the fault of our academic writing rather than of the understandability of the math!):

http://www.cs.sunysb.edu/~bender/pub/sicomp05-BenderDeFa.ps
http://www.cs.sunysb.edu/~bender/pub/FOCS03-co-searching.ps
http://www.cs.sunysb.edu/~bender/pub/locality-full.ps
http://www.cs.sunysb.edu/~bender/pub/BenderHu-TODS07.pdf
http://supertech.csail.mit.edu/cacheObliviousBTree.html

FYI: there’s a totally different index structure called Fractal Trees (see e.g. http://www.pdl.cmu.edu/ftp/Database/fpbtree.ps). This is unrelated to our work.

Martin Farach-Colton

ps. The fpb+-tree, also called a Fractal Tree, is a different beast entirely.

Victoria Eastwood

Hi Vadim,

Nice analysis of TokuDB. In regard to Infobright, we are designed for analytics rather than transaction processing. Our technology is a column oriented database based on MySQL, that uses what we call the “knowledge grid”, which eliminates the need to create any indexes or partition data while achieving great query performance. We also sport very high compression (better than 10:1, sometimes as high as 40:1). You can think of the knowledge grid as a metadata layer which describes at a high level the content of the entire database. It is used by our optimizer to eliminate as much data as possible when processing a query and minimize the need for decompression. Since we don’t have indexes, our load speed is one the fastest you’ll find. We have quite a few clients that are successfully using Infobright in production for web log analytics. We are also open-source so I invite you to check it out at http://www.infobright.org.

Cheers –V

Baron Schwartz

Martin, sorry for the hassle — your comments got spammed due to the links. Thanks. – Baron

Peter Zaitsev

Thanks Victoria,

I think Tokutek and Infobright have some overlap of simple analytic queries but their focus is very different.

Peter Zaitsev

Mark,

Good to see you mentioning log structured trees as well. It would be very interesting to understand how they compare to Fractal trees by Tokutek. Look very similar in terms of advantages offered to me. The good thing about them – as I understand they are not patented (at least they are implemented in number of technologies)

Surat

It would be nice to compare InnoDB compressed row format with TokuDB.

Bernhard

Is there a comparison of a maximum delayed insert rate of small records on Tokutek/MyISAM on the same platform?

Benjamin Tan

Is it fit using TokuDB engine when the record average length is 3k?
Maybe we should check the iiBench test case.

http://stackoverflow.com/questions/9424540/is-it-fit-using-tokudb-engine-when-the-record-average-length-is-3k