July 28, 2014

MySQL 5.6.7-RC in tpcc-mysql benchmark

MySQL 5.6.7 RC is there, so I decided to test how it performs in tpcc-mysql workload from both performance and stability standpoints.
I can’t say that my experience was totally flawless, I bumped into two bugs:

But at the end, is not this why RC for? And Oracle asked for a feedback, so I do my part.

  • Benchmark date: Oct-2012
  • Benchmark goal: Test how MySQL 5.6.7 performs
  • Hardware specification
    • Server: Dell PowerEdge R710
    • CPU: 2x Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
    • Memory: 192GB
    • Storage: Very Fast PCIe Flash Card
    • Filesystem: ext4
  • Software
    • OS: CentOS 6.3
    • MySQL Version: 5.6.7-RC
  • Benchmark specification
    • Benchmark name: tpcc-mysql
    • Scale factor: 2500W (~250GB of data)
    • Benchmark length: 4000 sec but the result is taken only for last 2000 sec to remove warm-up phase. Measurements are taken every second.
  • Parameters to vary: we vary innodb_buffer_pool_size:13, 25, 50, 75, 100, 125GB to have different memory/data ration. We vary innodb_buffer_pool_instances: 1 and 8, and innodb_log_file_size: 2x4GB and 2x8GB.
  • Results

    The first result is 2x4GB innodb logfiles.

    We can see that innodb_buffer_pool_instances=8 makes a big difference on small buffer_pool sizes, while on bigger buffer_pool, innodb_buffer_pool_instances=1 is more preferable.

    The the results on big buffer_pool is quite unstable, and the reason is that InnoDB falls into asynchronous flushing mode, the problem which was supposed to be fixed in new InnoDB flushing mechanism. However Dimitry told me that we may need a bigger innodb logfiles to get more stable results.

    So there it is with 2x4GB vs 2x8GB innodb logfiles.

    Obviously the result is quite better with bigger logs, so size does matter.

    Conclusion
    innodb_buffer_pool_instances parameter may change the result significantly, especially in intensive IO workloads.
    In MySQL 5.6 it is finally possible to achieve stable throughput without dips, but an adaptive flushing still requires big log files.

    MySQL configuration:

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Vadim, very informative and helpful to see what is coming in MySQL 5.6. Have you run this experiment further than the 4000 second runs to see if there are performance dips later in the benchmark?

  2. Vadim,

    As log files impact recovery speed I would not call “need larger log files” complete solution. We still have a problem of very unstable performance in certain configurations, such as with smaller log files. The proper solution should maintain reasonably stable performance even with smaller logs…. yet of course it is expected to be lower.

  3. Jestep says:

    Not clear on why there would be greater stability and a higher average throughput on single instance 100Gb vs 125Gb. Just guessing, but it looks like it could average 20% higher throughput with 20% less memory.

  4. James Day says:

    Thanks for testing! :) Nice job on the presentation of results as well.

    Expectations for innodb_log_file_size should be changing. Simon Mudd using 32GB of redo log file on a 192G box are the way things are going, particularly for people who want to maximise SSD life. Yet those with spinning disks and only a few tens of megabytes may be fine with 4G or less.

    The optimal value for innodb_buffer_pool_instances depends in part on the distribution of hot pages. A workload using large numbers of tables is likely to do well with a smaller value and many warm pages than one with the heat concentrated in only a few pages. Did you keep the scale factor the same? I think yes, just asking to aid in interpreting the results by being certain. Even so, you may have been spreading the heat around if the buffer pool hit rate increased significantly. Not as much as if you’d been increasing the number of tables, though.

    Views are my own, for an official Oracle opinion consult a PR person.

    James Day, MySQL Senior Principal Support Engineer, Oracle.

  5. James Day says:

    A “few tens of megabytes” should have been “a few tens of gigabytes of buffer pool” in my last comment.

  6. Tim,

    I’ve run up to 3 hours, the result seem quite stable even in long run.

  7. 2x8G log files means that you are more likely to get RMW stalls on the log because 16G is less likely to stay in the OS buffer cache, and keeping it all in the OS buffer cache might not be a good use for 16G of RAM. The alternative is to use the Percona option for O_DIRECT redo log writes, but that can increase contention on the transaction log mutex because it is released after write and before fsync on the redo log, but writes will be slower when using O_DIRECT.

    So, I think bigger redo logs is a workaround for now but I doubt we should rely on it.

  8. So, I was asked how long takes a recovery with such big logs.
    I ran with 125GB buffer_pool and 2x8GB logs and crashed mysqld at the peak of throughput.
    It took 9min 30sec for recovery procedure in this case.

  9. Dimitri says:

    Hi Vadim,

    thank you for your tests!! – it clarified some points, but I’m continuing my analyzes of this issue as well, as the most critical point here is to understand for what reasons InnoDB page flushing is unable to follow the REDO logs write speed in your case (as the storage level power is even not used in half).. – don’t know if you got a time to collect some stats I’ve asked from your server..

    Mark:
    – agree and not agree to consider huge REDO logs as a “workaround”.. – in fact it may help only if workload is generating a repetitive changes on some pages (or your REDO space was really small) — because if pages are modified widely, your huge REDO may be never fully used, as you’ll hit dirty pages pct limit, LRU flushing, etc. — but it’s true that on a huge BP size things may be different, so I’m looking on it right now..

    – regarding Read-On-Write issue with huge REDO logs: we tried to use O_DIRECT on it, and it was not better.. – in fact the REDO writes are designed in the following way: lock, write(), unlock, fsync() — so, if O_DIRECT is used, you’re keeping the “lock” time longer than a buffered write(). As a short fix, I’ve advised to have a “pre-fetch” here rather O_DIRECT, but it was to short to implement the final REDO “redesign” within MySQL 5.6..

    Peter: yes, the “proper solution” will work even with very short REDO logs, but 5.6 is already has a huge amount of performance improvements, so it’ll be pity to not ship it asap already ;-)

    BTW, do you remember the good old days with 128MB redo logs?.. – I’m pretty sure you’ll have a very stable performance with it ;-)) — very low, but very stable, as the server will be in loop with short sync flushes..

    Rgds,
    -Dimitri

  10. Dimitri – how do you define “better” when testing O_DIRECT? For some of us better means not spending IOPS doing random disk reads as part of the RMW when the redo log falls out of the OS buffer cache. Response time on writes isn’t the only metric. Amount of memory used (and wasted) by the redo log in the OS filesystem cache is another metric. Unfortunately there is no way to optimize for all three given the current implementation. Prefetching is another workaround, it improves latency and reduces memory used, but the cost is more random IOPS for disk reads.

  11. Dimitri says:

    Mark,

    in the test I’ve made, we simply got a worse QPS result when O_DIRECT was used for REDO logs. Don’t forget also that I/O operations on REDO files are mostly sequential, so there is no random reads involved, so just reading few MBs ahead or using fadvise() feature should avoid the Read-On-Write issue, while still keeping the lock contention within REDO code as low as before..

    From the other hand, if any kind of I/O reads are making troubles – then yes, O_DIRECT may help here — but it’ll depend on a workload, and that’s why I’ve posted a call for test workloads since this year to be able to cover as many cases as possible ;-)

    As well, when the load become very REDO writes centric, moving REDO files to SSD is helping a lot too! – but it’d be better adapted with 4K writes (means it’ll involve several re-writes of the same 4K block before to write the next one, comparing to the current sequential 512 bytes writes in REDO) — so all this stuff was too short to be changed within 5.6 time frame, and was pushed for 5.7

    Rgds,
    -Dimitri

  12. There are random reads. Domas and I have written about this elsewhere. Redo log writes are a multiple of 512 bytes. When writing the first fraction of a page and the page is not in the OS filesystem cache, then that page will be read into the OS filesystem cache so it can be written. This is the R in RMW and I have seen it consume 100+ IOPS on some servers.

  13. Intermittent read-ahead reads mixed in with the rest of the IO workload will look a lot like random reads. Sorry, but this is a workaround.

  14. Dimitri says:

    Mark,

    seems to me we analyzed different workloads.. – in my case having longer locks in REDO code was more critical than I/O reads. So, propose your own test case and I’ll analyze it with a pleasure ;-)

    Rgds,
    -Dimitri

  15. Does the PS instrument the time to do redo log writes? We have stats for that in the FB patch and it makes it easy to see the impact as high latency writes are the ones that do disk reads.

  16. Mark Leith says:

    Yep it does Mark, take a look for the wait/io/file/innodb/innodb_log_file event..

  17. James Day says:

    Vadim, thanks for reporting the crash recovery time, should reassure anyone who was wondering if it might be too slow to be practical – at least on this fast hardware.

    Dimitri, it’s perhaps worth noting that in practice, log activities are usually random, not sequential. This is because log files are typically kept on the same drives as the data. So you get random i/o between them, flushing and foreground work. This is in part due to limits on the number of drive bays that have been present in servers, which has tended to make it not beneficial to put logs on different drives compared to using the same drives to speed up other things as well. It’s also because of the desire to do snapshot backups, which have tended to only work if everything is on one volume. At the highest end this can become less true but at the low and more common end it’s normal and not very modifiable.

    Mark’s also right about the read, modify, write aspect and OS caching and it’s one reason why it’s important to be able to efficiently use much of the log space, not have large amounts of it unusable but still using RAM and disk space.

    For common production systems, the goal tends to be to increase the log file size to increase dirty page percentage because the log space tends to be the limiting factor. Hitting dirty page percentage caps is relatively uncommon. There are some workloads where it happens, though, and it can happen in data loading even when it doesn’t usually happen. More workloads around where high dirty page percentages isn’t the ideal balance between pages needed for read and reducing write i/o work.

  18. Dimitri says:

    James,

    it’s the recommendation #1 for HW setup to separate REDO logs from DATA and put them to a different device (same for binlog).. – but well, rethinking about all this stuff again, I think the most optimal will be to have all these options implemented within InnoDB code to leave a choice to a final user and be able to be adaptive according a given workload or HW capacity constraints..

    Rgds,
    -Dimitri

  19. James Day says:

    Sure, it’s the recommendation sometimes the real world constraints get in the way. Yes, options are good because things do vary.

  20. David says:

    Who of you guys wrote this article?
    Vadim Tkachenko – http://www.mysqlperformanceblog.com/2012/10/08/mysql-5-6-7-rc-in-tpcc-mysql-benchmark/
    or
    Peter Zaitsev – http://architects.dzone.com/articles/dealing-couple-bugs-mysql-567

    or is duplicate for SEO :)

    Many thanks for the useful test results.

  21. justdb says:

    Thanks Vadim!

    What about mysql5.6 on windows – can you test it on windows server?

    Can mysql 5.6 be a reliable production environment choice or still its only for development environment?

    Thank you again for great article!

  22. Nikola says:

    Hy,
    I’m new to TPC-C benchmarking and i would like to know how and what are the steps for testing MySQL with TPC-C benchmark.
    My goal is to compare VoltDB and MySQL under TPC-C benchmark.

    Thanks

  23. WilliamYao says:

    innodb_io_capacity = 20000 This parameter will get huge Dsik TPS ,when it is set by 200, TPS is half of 20000 and compare mysql_read wirte cpu is the same ,I can’t understand this, can you explain it ?

  24. innodb_io_capacity and innodb_lru_scan_depth are rate limits on the amount of dirty page writes that InnoDB will do per second per buffer pool instance. innodb_lru_scan_depth is new in 5.6 and the behavior for innodb_io_capacity has changed in 5.6:
    1) Innodb is much more likely to respect this limit, previously it could greatly exceed the limit
    2) Most of the writes are now scheduled by a background thread, the page cleaner thread. Previously user threads could trigger many writes for dirty pages at the LRU tail when the free list was empty. They are less likely to do that in 5.6

    You might need to set both in 5.6 and to larger values than used for innodb_io_capacity prior to 5.6

  25. Tyler says:

    These tests were done with Oracle’s MySQL Community, have they been repeated at all with Percona-Server? A previous post (http://www.mysqlperformanceblog.com/2010/12/20/mysql-5-5-8-and-percona-server-being-adaptive/) suggested that buffer_pool_instances=1 made more sense due to other Percona specific optimizations.

Speak Your Mind

*