May 24, 2007

Predicting how long data load would take

Posted by peter

I had this question asked many times during last week, and there is really no easy answer. There are just way too many variables to consider especially while loading large schemas with a lot of different table structures. So which variables affect the load speed:

Table Structure This one is tricky. The shorter rows you have the faster load will normally be, if you measure rows, but if you measure in amount of loaded data longer rows are faster.

Indexes The more indexes you have the longer load would take. Details depend on storage engines, for example MyISAM can build non-unique indexes by sort which is very fast and Innodb can use insert buffer for them.

Data Insert Order It is well known it is important to load data in primary key order into Innodb tables for optimal performance, but effect does not end here. If indexes are built by inserts in Btree order in which entries are inserted in the index can affect performance dramatically.

Table Size This is the worst trouble maker. Typically load starts fast but as it progresses it may slow down dramatically, typically if Index BTREE has to be built by insertion and data does not fit in memory any more. I’ve seen cases when load speed starts at 30.000 rows/sec and then goes down to less than 100 rows/sec which can kill your initial estimates and can cause much longer downtime than planned.

Hardware Configuration If you need to load data quick do it on decent hardware. Especially memory size is important. If you need database on some small box it may be faster to load it on the more powerful box and copy it back in a binary form.

MySQL Configuration Different storage engines have different settings which need to be set for optimal load speed. Depending on load type MyISAM may benefit from bulk_insert_tree_size increase myisam_sort_buffer_size or key_buffer_size increase. Innodb typically needs large innodb_buffer_pool_size and large innodb_log_file_size to perform load effectively

Load Options There two main ways to load data in MySQL you can use Multiple value insert (standard mysqldump output) or LOAD DATA INFILE (–tab mysqldump output). Generally LOAD DATA can be optimized better and a bit faster because of easier parsing. Mysqldump however takes care of special options to make multi value inserts as fast as possible as well, such as disabling indexes before the load and enabling them back - helpful to make sure MyISAM build indexes via sorting not by BTREE insertion. For Innodb SET UNIQUE_CHECKS=0 may be used to speed up load to tables with unique keys.

So as you can see there are many variables which affect load speed which makes it extremely hard to predict. Especially fact load can slow down dramatically as amount of data loaded in the table increased and the fact this depends not only on table structure but the data itself (it defines the insertion order for the indexes) cause the challenges.

If you have some particular history about loading similar data of similar size on similar hardware use that as estimate. If not use your best guess based on what other cases were “similar”

As the data load goes I usually tend to do something like “du -sh; sleep 3600; du -sh” in the database data directory to see how much data is loaded per hour (assuming your Innodb tablespace did not have free space or you’re using innodb_file_per_table=1). Do not expect the value to be constant though. During our recent data load speed we for example could see data load speed of 60GB during one hour and 10GB during another depending on the tables which were loaded.

During our recent data load we loaded almost 1TB of Innodb data within 24 hours. It was decent speed because we had tables partitioned and so no table was more than 15GB in size (allowing to fit all BTREE indexes in memory) plus we loaded data in parallel and our rows were relatively long. In most cases I would expect load speed to be a bit slower, sometimes much slower.
I’ve seen cases when loading 100-200GB of data in single table was taking a week due to amount of indexes and the fact they poorly fit in memory.

Innodb Recovery Update - The tricks what failed.

Posted by peter

As I wrote we had been recovering corrupted Innodb Tablespace and it is finally done now.

As this was over than 1TB worth of data we really tried to avoid dumping the data and find some other way to recovery.
Examining Innodb page content and crash information we figured out it should be page of insert buffer itself, rather than page belonging to some of the pages touched by insert buffer.

If that would be just the table it should have been easy - dropping the table with corrupted page would remove all appropriate insert buffer records and be able just to load that table back. Unfortunately we were not that lucky.

If the page corrupted in insert buffer itself there seems to be no way to make Innodb avoid touching it. Even when we dropped all Innodb tables (for test purposes) Innodb still was crashing if we attempted to enable insert buffer.

I think it would be nice for Innodb to have some additional recovery options for example for skipping over insert buffer or undo buffer. True this would leave some tables corrupted but at least you could get by by dumping in reimporting couple of tables rather than full dump and restore.

The bug which we found during attempt to dump tables with innodb_force_recovery=4 was confirmed to be added in 5.0.33, meaning you could not really recover corrupted Innodb tables this way for about 6 months.

What does this mean ? I guess it could mean all sorts of different things:

Versions later than MySQL 5.0.33 are not frequently used True many distributions still have versions before this one which could have some impact.

Innodb tables rarely become corrupt Indeed even if you leave out crashes in my experience Innodb tables become corrupted more seldom than MyISAM. Thanks to checksums and crashing in case of corruptions and a lot of assertions most bugs were cleared out.

People finding the bug do not report it This is also probably true. Myself I probably report half of the bugs I find, the once I find most annoying or find easy to provide repeatable example for so they can be fixed.

Corruption usually happens in secondary index, so simple ALTER TABLE fixes it This also seems to be the case. In my experience probably 80-90% of Innodb corruptions are fixed this way, especially if they are not caused by faulty hardware.
Primary keys are typically simple integer keys in Innodb while secondary keys may have very complex structure.

Compex corruptions are resolved by dump and restore from backup. I think this is the main reason. In fact dump and reload process can be so slow restoring even week old backup and running roll forward recovery using binary log can be a lot faster.

MySQL Geek Job Openings

Posted by peter

The consulting load keeps increasing so we’re looking for some help.

This job would be perfect for someone interested in high performance and scaling with decent knowledge of MySQL and eagerness to learn more.

We do encourage people from all countries to apply.


This page was found by: innodb 1tb database ...