July 20, 2009

XtraDB storage engine release 1.0.3-6

Posted by Aleksandr Kuzminsky |

Dear community,

Today we are pleased to announce release 6 of XtraDB – the result of 2 months hard work.

The release includes following new features:

  • MySQL 5.1.36 as a base release
  • New patch innodb_recovery_patches.patch
  • Experimental adaptive checkpoint method estimate
  • innodb_stats – the implementation of the fix forMySQL Bug#30423
  • expand-import Support of import InnoDB / XtraDB tables from another server
  • split-bufpool-mutex-3 New patch to split buffer pool mutex
  • g-style-io-thread Google’s fixes to InnoDB IO threads
  • dict-size-limit Limit of internal data dictionary

Fixed bugs:

The builds for RedHat4,5 and Debian are located on http://www.percona.com/mysql/xtradb/5.1.36-6/
The latest source code of XtraDB, including development branch you can find on LaunchPAD.

Please report any bugs found on Bugs in Percona XtraDB Storage Engine for MySQL.
For general questions use our Pecona-discussions group, and for development question Percona-dev group.

For support, commercial and sponsorship inquiries contact Percona

June 11, 2009

The feature I love in TokuDB

Posted by Vadim |

Playing with TokuDB updates I noticed in SHOW PROCESSLIST unsual for MySQL State.

CODE:
  1. mysql> show processlist;
  2. +----+------+-----------+--------+---------+------+---------------------------+-----------------------------+
  3. | Id | User | Host      | db     | Command | Time | State                     | Info                        |
  4. +----+------+-----------+--------+---------+------+---------------------------+-----------------------------+
  5. 3 | root | localhost | sbtest | Query   |   30 | Updated about 764000 rows | update sbtest set email=zip |
  6. ...
  7. mysql> show processlist;
  8. +----+------+-----------+--------+---------+------+----------------------------+-----------------------------+
  9. | Id | User | Host      | db     | Command | Time | State                      | Info                        |
  10. +----+------+-----------+--------+---------+------+----------------------------+-----------------------------+
  11. 3 | root | localhost | sbtest | Query   |   79 | Updated about 1900000 rows | update sbtest set email=zip |
  12. ...

(Do not look in stupid UPDATE query, it's just for testing :) )

So looking in SHOW PROCESSLIST you can see progress of query execution.

I would want to see it in standard MySQL and InnoDB more than all these triggers and stored routines! Probably will implement this in XtraDB.

April 28, 2009

Detailed review of Tokutek storage engine

Posted by Vadim |

(Note: Review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)

I had a chance to take look TokuDB (the name of the Tokutek storage engine), and run some benchmarks. Tuning of TokuDB is much easier than InnoDB, there only few parameters to change, and actually out-of-box things running pretty well.

There are some rumors circulating that TokuDB is ”.. only an in memory or read-only engine, and that's why inserts are so fast”. This is not actually the case, as TokuDB is a disk-based, read-write transactional storage engine that is based on special “fractal tree indexes”. Fractal Trees are a drop-in-replacement for a B-tree (based on current research in data structures by professors at Stony Brook, Rutgers, and MIT). I can't say exactly how it is improved, because the engine itself is closed source.
[read more...]

April 15, 2009

How to decrease InnoDB shutdown times

Posted by Baron Schwartz |

Sometimes a MySQL server running InnoDB takes a long time to shut down. The usual culprit is flushing dirty pages from the buffer pool. These are pages that have been modified in memory, but not on disk.

If you kill the server before it finishes this process, it will just go through the recovery phase on startup, which can be even slower in stock InnoDB than the shutdown process, for a variety of reasons.

[read more...]

April 8, 2009

XtraDB storage engine release 1.0.3-4 codename Sakura

Posted by Evgeniy Stepchenko |

Today we glad to announce release 1.0.3-4 of our XtraDB storage engine.

Here is a list of enhancements in this release:

Percona XtraDB 1.0.3-4 (Sakura) available in source and several binary packages.

XtraDB is compatible with existing InnoDB tables (unless you used innodb_extra_undoslots) and we are going to keep compatibility in further releases. We are open for features requests for new engine and ready to accept community patches. You can monitor Percona’s current tasks and further plans on the Percona XtraDB Launchpad project. You can also request features and report bugs there. Also we have setup two maillists for General discussions and for Development related questions.

April 6, 2009

MySQL and IBM

Posted by Vadim |

No, this is not about Sun and IBM :) This is about MySQL. If you download latest 5.1.33 source code you may find there storage/ibmdb2i directory, which obviously is IBM DB2 related. Interesting that there is no mentioning of new engine in Announcement http://dev.mysql.com/doc/refman/5.1/en/news-5-1-33.html.
Quick look into source code says

CODE:
  1. MYSQL_STORAGE_ENGINE([ibmdb2i], [], [IBM DB2 for i Storage Engine],                                                               
  2.         [IBM DB2 for i Storage Engine], [max,max-no-ndb])                                                                         
  3. MYSQL_PLUGIN_DYNAMIC([ibmdb2i], [ha_ibmdb2i.la])

Also interesting that license of added files is not GPL, but

CODE:
  1. /*
  2. Licensed Materials - Property of IBM
  3. DB2 Storage Engine Enablement
  4. Copyright IBM Corporation 2007,2008
  5. All rights reserved
  6. Redistribution and use in source and binary forms, with or without modification,
  7. are permitted provided that the following conditions are met:
  8. (a) Redistributions of source code must retain this list of conditions, the
  9.      copyright notice in section {d} below, and the disclaimer following this
  10.      list of conditions.
  11. (b) Redistributions in binary form must reproduce this list of conditions, the
  12.      copyright notice in section (d) below, and the disclaimer following this
  13.      list of conditions, in the documentation and/or other materials provided
  14.      with the distribution.
  15. (c) The name of IBM may not be used to endorse or promote products derived from
  16.      this software without specific prior written permission.
  17. (d) The text of the required copyright notice is:
  18.        Licensed Materials - Property of IBM
  19.        DB2 Storage Engine Enablement
  20.        Copyright IBM Corporation 2007,2008
  21.        All rights reserved
  22. THIS SOFTWARE IS PROVIDED BY IBM CORPORATION "AS IS" AND ANY EXPRESS OR IMPLIED
  23. WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
  24. MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
  25. SHALL IBM CORPORATION BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
  26. EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
  27. OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  28. INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  29. CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
  30. IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
  31. OF SUCH DAMAGE.
  32. */

I think this is outcome of 2-year old press release
"MySQL AB and IBM Announce Open Source Database Support for the IBM System i Platform", it just took a bit a while to put it into source tree. I wonder what happened with policy not accept significant changes into production release.

April 28, 2008

The MySQL optimizer, the OS cache, and sequential versus random I/O

Posted by Baron Schwartz |

In my post on estimating query completion time, I wrote about how I measured the performance on a join between a few tables in a typical star schema data warehousing scenario.

In short, a query that could take several days to run with one join order takes an hour with another, and the optimizer chose the poorer of the two join orders. Why is one join order so much slower than the other, and why did the optimizer not choose the faster one? That's what this post is about.

Let's start with the MySQL query optimizer. The optimizer tries to choose the best join order based on its cost metric; it tries to estimate the cost for a query, then choose the query plan that has the lowest cost. The unit of cost for the MySQL query optimizer is a single random 4k data page read. In general, it's a pretty good metric, but it has one major weakness: the server doesn't know whether a read will be satisfied from the operating system cache, or whether it'll have to go to disk. (This distinction is abstracted away by the storage engine; the optimizer doesn't know how a given storage engine stores its data).

I'll try to omit the details and keep this clean. Let's take a look at the tables.

SQL:
  1. mysql> SHOW TABLE STATUS LIKE 'fact'\G
  2. *************************** 1. row ***************************
  3. Name: fact
  4. Engine: MyISAM
  5. Rows: 147045493
  6. Avg_row_length: 117
  7. Data_length: 17217646764
  8. Index_length: 11993816064
  9.  
  10. mysql> SHOW TABLE STATUS LIKE 'dim1'\G
  11. *************************** 1. row ***************************
  12. Name: dim1
  13. Engine: MyISAM
  14. Rows: 453193
  15. Avg_row_length: 122
  16. Data_length: 55605116
  17. Index_length: 93812736
  18.  
  19. mysql> SHOW TABLE STATUS LIKE 'dim2'\G
  20. *************************** 1. row ***************************
  21. Name: dim2
  22. Engine: MyISAM
  23. Rows: 811
  24. Avg_row_length: 105
  25. Data_length: 85368
  26. Index_length: 154624

It's a big fact table and two fairly small dimension tables, which is normal. Here is the query:

SQL:
  1. SELECT fact.col1, min(fact.col2) AS min_col2
  2. FROM fact, dim1, dim2
  3. WHERE fact.col4 = dim1.col4
  4. AND dim1.col3 <> 'hello world'
  5. AND dim2.col5 = 1
  6. AND fact.dim2_id = dim2.dim2_id
  7. AND fact.col2> some_const
  8. GROUP BY fact.col1

There are indexes on all the columns in all the ways you'd expect: all the dimension columns are indexed on every table, and there's a separate index on every column in the WHERE clause. Here's the query plan initially.

SQL:
  1. *************************** 1. row ***************************
  2. TABLE: dim1
  3. type: range
  4. key_len: 195
  5. rows: 18790
  6. Extra: USING WHERE; USING TEMPORARY; USING filesort
  7. *************************** 2. row ***************************
  8. TABLE: fact
  9. type: ref
  10. key_len: 4
  11. rows: 606
  12. Extra: USING WHERE
  13. *************************** 3. row ***************************
  14. TABLE: dim2
  15. type: eq_ref
  16. key_len: 2
  17. rows: 1
  18. Extra: USING WHERE

This query will run for days and never complete. No one ever let it finish to see how long it would run.

How do I know it will run for days? Here's my train of thought:

  • It's performing index lookups into the fact table, which is big.
  • An index lookup is a random I/O.
  • A modern disk can do about 100 random I/O's per second, as a rule of thumb.
  • If you do the math with the rows column in EXPLAIN, you realize that this equates to about 18790 * 606 = 11386740 I/O operations, assuming that the indexes are fully in memory.
  • When you divide this by 100 I/O operations per second, and then divide that by 86400 seconds in a day, you get about 2.6 days.

Ouch! That's slow.

Now let's look at the alternative: table-scan the fact table, and do index lookups in the two dimension tables. MySQL doesn't want to choose this join order, so we'll force it with STRAIGHT_JOIN:

SQL:
  1. EXPLAIN SELECT STRAIGHT_JOIN  ....
  2. +-------+-----------+-----------+---------------------------------+
  3. | TABLE | type      | rows      | Extra                           |
  4. +-------+-----------+-----------+---------------------------------+
  5. | fact  | ALL       | 147367284 | USING TEMPORARY; USING filesort |
  6. | dim1  | eq_ref    | 1         | USING WHERE                     |
  7. | dim2  | eq_ref    | 1         | USING WHERE                     |
  8. +-------+-----------+-----------+---------------------------------+

As we saw in the previous post, which I linked at the top of this post, we can scan the fact table in less than an hour. And it turns out that joining to the dimension tables doesn't slow the query perceptibly, because these tables are small and they stay in memory, in the OS cache. (They don't get evicted from memory by the cache's LRU policy, because they are frequently used -- once per row in the fact table. The LRU policy evicts old blocks from the fact table instead, which is perfect -- these blocks are used only once and not needed again, so they can be replaced).

The difference between the two queries -- 55 minutes and 2.6 days -- is basically the difference between scanning data sequentially on disk and random disk I/O.

So now you know why one join order is faster than the other. But why didn't the optimizer know this, too? The optimizer does know that random access is slower than sequential access, but it doesn't know that the dimension tables will stay in memory, and this is an important distinction.

Let's put ourselves into the mindset of the optimizer. We'll assume that every join to the dimension tables will go to disk instead of being read from cache. Now the STRAIGHT_JOIN becomes a table scan of about 313 sequential reads (150 million rows / 117 bytes per row / 4096 bytes per read), plus about 150 million random I/Os for the first dimension table, plus 150 million random I/Os for the second dimension table. That's 300 million random I/O operations.

In contrast, the optimizer chose a plan that it thought would cause only 11.3 million random I/O operations.

The optimizer was being smart, given its lack of knowledge about the OS cache. This is why an expert is sometimes needed to provide the missing information. If the MySQL optimizer were right and each of these had to go to disk, our STRAIGHT_JOIN plan would take more than a month to complete! Good thing we know the difference between cache and disk!

December 19, 2007

MVCC: Transaction IDs, Log Sequence numbers and Snapshots

Posted by peter |

MySQL Storage Engines implementing Multi Version Concurrency Control have several internal identifiers related to MVCC. I see a lot of people being confused what they are and why they are needed so I decided to take a time to explain it a bit. This is general explanation, it does not corresponds to Innodb in particular and some implementation can be different but I hope this will let you to understand MVCC a bit better.
[read more...]

November 26, 2007

Data Recovery Toolkit for InnoDB Version 0.1 Released

Posted by Alexey Kovyrin |

As Peter mentioned in one of previous posts, we've done huge work developing robust strategies of InnoDB data recovery to provide our customers effective data recovery services and one of major parts of these strategies is our toolkit for InnoDB data recovery. Today I'm proud to announce its first public release which was used to help some of our customers to recover 95-100% of their deleted data.

This release already has a pretty decent set of features:

  • Supports both REDUNDANT (pre mysql 5.0) and COMPACT (mysql 5.0+) versions of tablespaces
  • Works with single tablespaces and file-per-table tablespaces
  • Able to recover data even when processed InnoDB page has been reassigned to another table and/or was partially destroyed
  • Supports all MySQL data types except BLOBs, SETs and BITs (will be implemented in next releases)
  • Has really great set of data filters to define data ranges (for numbers), field lengths (for variable length fields), character sets (for strings), date periods (for dates), etc.
  • Shipped with easy to use tool which could be used to create innodb table definitions based on CREATE TABLE clauses, so you don't need to write table definitions yourself - you just need to add data filters and get your data back (well, in most of the cases)
  • Results are presented in CSV file format which could be used with MySQL's LOAD DATA function

So, if you intrigued enough and would like to check it out, welcome to Google Code page of the project where you can find latest version of the toolset code and more links to information resources related to InnoDB data structures and recovery procedures.

May 26, 2006

MyISAM mmap feature (5.1)

Posted by Vadim |

As you know MyISAM does not cache data, only indexes. MyISAM assumes OS cache is good enough and uses pread/pwrite system calls for reading/writing datafiles. However OS is not always good in this task, my benchmarks show Linux/Solaris aren't scalable on intensive pread calls (I believe the same for Windows, but I did not test it).
In 5.1 I implemented a new feature: memory mapping for the datafiles. That can be enabled by --myisam_use_mmap=1 startup option.
In this case instead of systems call MyISAM will use memcpy function. There is a memory addressing limit for 32bit platforms - 2Gb, so the datafiles over 2GB will be used the old way - pread/pwrite functions. Mmap is available on all POSIX-compatible platforms. It will work faster for SELECT/UPDATE/INSERT inside file queries, and no performance gain (maybe a bit slower) for INSERT at the end of file. In case with INSERT at the end of file we have to use a remap technic - resize memory mmaped area to new extended size. Currenlty we call remap once per 1000 inserts at the end of file and on an exclusive operation (DELETE/UPDATE/INSERT inside file), for work with non-mmaped area we use pread/pwrite functions.
To approve effectiveness of memory mapping several benchmarks:

[read more...]