April 23, 2014

Predicting how long data load would take

I had this question asked many times during last week, and there is really no easy answer. There are just way too many variables to consider especially while loading large schemas with a lot of different table structures. So which variables affect the load speed:

Table Structure This one is tricky. The shorter rows you have the faster load will normally be, if you measure rows, but if you measure in amount of loaded data longer rows are faster.

Indexes The more indexes you have the longer load would take. Details depend on storage engines, for example MyISAM can build non-unique indexes by sort which is very fast and Innodb can use insert buffer for them.

Data Insert Order It is well known it is important to load data in primary key order into Innodb tables for optimal performance, but effect does not end here. If indexes are built by inserts in Btree order in which entries are inserted in the index can affect performance dramatically.

Table Size This is the worst trouble maker. Typically load starts fast but as it progresses it may slow down dramatically, typically if Index BTREE has to be built by insertion and data does not fit in memory any more. I’ve seen cases when load speed starts at 30.000 rows/sec and then goes down to less than 100 rows/sec which can kill your initial estimates and can cause much longer downtime than planned.

Hardware Configuration If you need to load data quick do it on decent hardware. Especially memory size is important. If you need database on some small box it may be faster to load it on the more powerful box and copy it back in a binary form.

MySQL Configuration Different storage engines have different settings which need to be set for optimal load speed. Depending on load type MyISAM may benefit from bulk_insert_tree_size increase myisam_sort_buffer_size or key_buffer_size increase. Innodb typically needs large innodb_buffer_pool_size and large innodb_log_file_size to perform load effectively

Load Options There two main ways to load data in MySQL you can use Multiple value insert (standard mysqldump output) or LOAD DATA INFILE (–tab mysqldump output). Generally LOAD DATA can be optimized better and a bit faster because of easier parsing. Mysqldump however takes care of special options to make multi value inserts as fast as possible as well, such as disabling indexes before the load and enabling them back – helpful to make sure MyISAM build indexes via sorting not by BTREE insertion. For Innodb SET UNIQUE_CHECKS=0 may be used to speed up load to tables with unique keys.

So as you can see there are many variables which affect load speed which makes it extremely hard to predict. Especially fact load can slow down dramatically as amount of data loaded in the table increased and the fact this depends not only on table structure but the data itself (it defines the insertion order for the indexes) cause the challenges.

If you have some particular history about loading similar data of similar size on similar hardware use that as estimate. If not use your best guess based on what other cases were “similar”

As the data load goes I usually tend to do something like “du -sh; sleep 3600; du -sh” in the database data directory to see how much data is loaded per hour (assuming your Innodb tablespace did not have free space or you’re using innodb_file_per_table=1). Do not expect the value to be constant though. During our recent data load speed we for example could see data load speed of 60GB during one hour and 10GB during another depending on the tables which were loaded.

During our recent data load we loaded almost 1TB of Innodb data within 24 hours. It was decent speed because we had tables partitioned and so no table was more than 15GB in size (allowing to fit all BTREE indexes in memory) plus we loaded data in parallel and our rows were relatively long. In most cases I would expect load speed to be a bit slower, sometimes much slower.
I’ve seen cases when loading 100-200GB of data in single table was taking a week due to amount of indexes and the fact they poorly fit in memory.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. shinguz says:

    What is your opinion/experience about preloading indexes and tables (MyISAM and InnoDB)?

  2. Jason Wang says:

    Peter,

    Thanks for this wonder blog on the load performance. It is so timely. I was trying to evaluate MySql performance to see if it could fulfill our requirement. I found the data insertioin speed slows down drammatically (about 300 times slower) when the table size reaches to a certain number, just like you mentioned in this blog. I tuned on the paramerters like innodb_buffer_pool_size, large innodb_log_file_size, and a few other ones to the max values my machine can take. It does increase the speed and also push the threshold of the break point to a much larger table size. but I can not avoid the break down of the insertion speed completely. It will happen sooner or later. My question is that is this type of insertion speed drammtic slow down scenario can never be avoided no matter how large the CPU and RAM are? The result of larger CPU and RAM can only push the break down further on a larger table size, but can completely elimiate it?

    I did this test on MySql 5.0 and my laptop has 2.6 GHZ CPU and 2GB Memory. I found the insertion speed slow down at about 1GB table size. I use JDBC to insert 50,000 records with one insertion. the insertion time is 3 second. when it reach Number 100 insertion, the insertion time eventually reaches 900 seconds and still increase.

    Thank you very much with your earliest help!

    Jason

  3. peter says:

    Jason,

    The slowdown happens because BTREE (which tends to get a lot of random IO) becomes too large to fit in cache.
    The solution is simple – keep BTREE small enough so it does not happen.

    In MySQL 5.1 you can use partitions to use just this, in MySQL 5.0 and below you can just use multiple tables.

  4. Jason Wang says:

    Peter,

    Thanks do much for the prompt reply!

    I tried the table partitioning, but it did not seem to work. I might not partition on the right values. but again, how do I find out a table is partitioned? “desc table_name” dose no show any information on the partitioning.

    thanks,

    Jason

  5. Jason Wang says:

    Peter,

    I am sorry that I spoke too soon. I tried it again, and it seemed working. thanks Peter!

    Jason

  6. peter says:

    Good :)

  7. Mike says:

    So which is faster, if I have 200,000+ records coming from a text file, using php and doing inserts or doing a bulk data import? Is there a significant difference?

  8. peter says:

    Mike,

    Using direct data load will be faster. How much faster depends on a a lot of factors – data types and size, amount of indexes etc.

  9. f00 says:

    engine=innodb

    one CSV datafile 100 million rows sorted in primary key order (order is important to improve bulk loading times – remember innodb clustered primary keys)

    truncate ;

    set autocommit = 0;
    load data infile into …
    commit;

    runtime 10-15 mins depending on your hardware.

    other optimisations as mentioned above if applicable:

    unqiue_checks = 0;
    foreign_key_checks = 0;

  10. Anil says:

    >> “If indexes are built by inserts in Btree order in which entries are inserted in the index can affect performance dramatically”

    How do I read this?

  11. Foobarista says:

    One trick for really huge, hash-partitioned InnoDB tables is to sort by the partition rule, then by the primary key. The “partition rule” is usually mod(partition_key, ), so we found a good ordering is

    insert into target-tab …
    select from source_tab …
    order by mod(target_partition_key, ), target_primary_key;

    This will load your partitions one-by-one, and within the partition, load in PK order. This will result in very nice buffer pool behavior. We loaded a 1.5 billion row table in “only” 30 hours using this approach.

  12. Foobarista says:

    In my post above, the second argument to mod() is n_partitions.

  13. Chris says:

    I’ve got a 900mb sql file that’s taking forever to import. Going on 2 hours now. I’ve never had one take that long, so I have no idea what the issue is, but it would be nice to have a faster way.

  14. Sheyda says:

    Does it help the speed of load if splitting the outfile into multiple csv files? I have 28 Gig of data.

Speak Your Mind

*