I’m often asked how one can evaluate IO subsystem (Hard drive RAID or SAN) performance for MySQL needs so I’ve decided to write some simple steps you can take to get a good feeling about it, it is not perfect but usually can tell you quite a lot of what you should expect from the system.
What I usually look for MySQL is performance in random reads and random writes. Sequential reads and writes are rarely the problem for OLTP workloads, so we will not look at them.
I also prefer to look at performance with O_DIRECT flag set to bypass OS cache. This may execute separate code path in kernel and so has a bit different performance pattern compared to buffered IO (even followed by fsync regularly) , but it allows to easily bypass OS cache both for reads and for writes and so does not require creating large working sets for boxes with significant amounts of memory (or reducing amount of usable memory).
The system I’m testing this on has 256MB BBU (Battery Backed up Cache) on RAID controller so we will test two workloads size – first one is small which fits in the cache and the second one will be large enough so it does not. This allows us to see both by RAID cache and uncached IO performance.
We’re interested in cached reads because they may show latency to the RAID cache if RAID read cache is enabled. If it is disabled you may be reading from Drives cache or even drives themselves which will affect performance significantly. In any case especially playing with data size a bit you will well learn how your cache behaves.
Were even more so interested about writes which show us how many cached writes per second we can do – this is important for log writes, which are synchronous IO operation database performance which typically have rather close data locality.
We’re also interested in uncached reads and writes because this correspond to general database workload.
It is worth to run the test with 1 thread and with some higher number (say 64) to see how things scale.
The tool I’m using for this is SysBench which was designed by my team when I still worked for MySQL and we specially implemented bunch of tests to ease hardware evaluation for things which are important for MySQL.
To prepare small 128MB single file working set we can use the following command:
1 | [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M prepare |
And when we can run the test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --file-test-mode=rndrd run sysbench v0.4.8: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 16384 1 files, 128Mb each 128Mb total file size Block size 16Kb Number of random requests for random IO: 1000000 Read/Write ratio for combined random IO test: 1.50 Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random read test Threads started! Time limit exceeded, exiting... Done. Operations performed: 773835 Read, 0 Write, 0 Other = 773835 Total Read 11.808Gb Written 0b Total transferred 11.808Gb (201.52Mb/sec) 12897.22 Requests/sec executed Test execution summary: total time: 60.0001s total number of events: 773835 total time taken by event execution: 59.0285 per-request statistics: min: 0.0001s avg: 0.0001s max: 0.0086s approx. 95 percentile: 0.0001s Threads fairness: events (avg/stddev): 773835.0000/0.00 execution time (avg/stddev): 59.0285/0.00 |
The output is pretty verbose so it is quite scary for many people, however there is basically one number important here 12897.22 req/sec – so the RAID cache on this drive can do about 13.000 16K reads/sec from the cache, not bad at all.
With 64 threads I get 23727.46 req/sec which shows even in such cached mode you can get better performance by having many outstanding requests.
Running with more than one threads you may also take a look at another table SysBench displays:
1 2 3 4 5 | per-request statistics: min: 0.0008s avg: 0.0027s max: 0.0112s approx. 95 percentile: 0.0030s |
So we get about 3ms 95 percentile request time – not bad – all request were executed with pretty uniform performance.
Let us now do the test with single thread and write-through RAID cache (as if there would not be any BBU)
1 2 | [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --num-threads=1 --file-test-mode=rndwr run sysbench v0.4.8: multi-threaded system evaluation benchmark |
Auch… We get 142.62 Requests/sec which is extremely poor considering this is 8 drives array. You would likely get a bit better with sequential log writes but it is not going to be much higher.
Lets see what we get with 64 threads… hm again 142.83 Requests/sec and looking at VMSTAT I can see only 1 blocked process all the time which does not sounds right.
Honestly I did not expect this result while running benchmarks for this articles (and this is always very exciting to find something unexpected while doing Benchmarks). Fortunately I’ve already seen this some time ago though I hoped this issue is long fixed since that time… Though this box is running CentOS4 which is not the newest OS out where.
The problem is basically in case of O_DIRECT writes there is only one IO can be happening per file at any given time. To show this is the problem let us create working set consisting out of number of files and repeat the run:
1 2 3 4 5 6 7 8 9 10 | [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=128 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --num-threads=64 --file-test-mode=rndwr cleanup sysbench v0.4.8: multi-threaded system evaluation benchmark Removing test files... [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=128 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --num-threads=64 --file-test-mode=rndwr prepare sysbench v0.4.8: multi-threaded system evaluation benchmark 128 files, 1024Kb each, 128Mb total Creating files for the test... [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=128 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --num-threads=64 --file-test-mode=rndwr run |
Now we get 827.94 Requests/sec requests per second, which is not bad considering this is RAID5 volume.
Let us now repeat the tests for WriteBack cache configuration:
We get 4735.27 Requests/sec from 1 thread and single file; 4740.19 Requests/sec from 64 threads and 1 file and 5228.68 Requests/sec with 64 threads and 128 files.
As you can see numbers are a lot better especially when there is single thread doing IO (just as with database log). We can see serialization with O_DIRECT however the impact is much less in this case compared to Write Through mode because we’re mostly limited by RAID controller capacity.
Let us now do test with larger test set. We’ll use 2GB in 128 files (because we already discovered synchronization issue with O_DIRECT). This sounds small but as controller cache is just 256M it should be good enough.
1 2 3 4 5 6 | [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=128M --num-threads=64 --file-test-mode=rndwr cleanup sysbench v0.4.8: multi-threaded system evaluation benchmark Removing test files... [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000 --file-num=128 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=2G --num-threads=1 --file-test-mode=rndwr prepare sysbench v0.4.8: multi-threaded system evaluation benchmark |
There is also one trick running such larger scale benchmarks. SysBench by default has repeatable sequence of requests, because it does not initialize random generator. If you’re doing multiple runs make sure to add –init-rng=1 option otherwise results can be skewed.
So we do run as:
1 | [root@DB10 sysbench]# ./sysbench --init-rng=1 --test=fileio --max-time=60 --max-requests=1000000 --file-num=128 --file-extra-flags=direct --file-fsync-freq=0 --file-total-size=2G --num-threads=1 --file-test-mode=rndrd run |
For single thread we get 259.28 Requests/sec which approximately matches what single drive can do. This is important lesson – even if you have many hard drives you can’t really get advantage of them unless you can get multiple outstanding requests.
Performance with 64 threads was 2334.63 Requests/sec which is perfect scalability to 6 hard drive this array has. We have even a bit more than 8x because 64 outstanding requests allow RAID controller to optimize seeks.
Lets us see how writes are doing:
For writes we’re getting 583.43 Requests/sec for 1 thread and 605.88 Requests/sec for 64 threads.
As you can see we got more writes for single thread than we had reads and also using multiple threads for writes did not improve performance as dramatically as for reads. This is because as we have BBU writes are just stored in controller cache and flushed to the disk in background – in this case there is little difference if we use one thread or multiple threads.
As you also can see BBU is helpful even for random writes if they are performed from single thread – so it does not only help database log writes but has much larger positive impact on workload.
I should also note it is very good idea to watch VMSTAT while doing the test. Here is the snipped from good VMSTAT from the read test:
1 2 3 4 5 6 | 0 64 90560 3196492 86832 12631072 0 0 37760 23 3402 6932 0 12 64 24 0 64 90560 3196500 86836 12631068 0 0 37222 26 3328 7694 0 1 75 24 0 64 90560 3196436 86836 12631068 0 0 36998 6 3340 7716 0 1 74 24 0 64 90560 3196428 86840 12631064 0 0 37395 12 3320 7774 0 1 74 24 0 64 90560 3196436 86840 12631064 0 0 37498 6 3365 7758 0 1 74 24 0 64 90560 3196436 86840 12631064 0 0 37702 6 3370 7680 0 2 74 24 |
As you can see all 64 threads are waiting for IO. Now lest see our write test which seems to have some serialization issues
1 2 3 4 5 | 2 39 90560 5305556 86920 10534204 0 0 0 9345 1625 4587 0 0 70 30 1 1 90560 5305548 86920 10534204 0 0 0 9017 1661 4934 0 0 74 26 1 1 90560 5305556 86924 10534200 0 0 0 9223 1638 4881 0 0 73 27 1 3 90560 5305556 86932 10534192 0 0 0 9634 1715 5033 0 0 74 26 1 1 90560 5305556 86932 10534192 0 0 0 9235 1643 5035 0 0 73 26 |
You can get similar information by analyzing iostat numbers and that well may be more accurate. Though looking at VMSTAT is often good enough to see the problem.
P.S If you’re wondering which hardware I used for benchmarks – it is Dell PowerEdge 2950 with PERC5 and 6 10K RPM hard drives in RAID5.
Battery Backed Caches are really a nice feature. Some time ago we were wondering why one of our test servers performed abysmally compared to the other few, even though they were supposed to be identical.
Only after some time we realized that that particular one had been sent without a battery pack installed…
Absolutely. I still being surprised how some people buy 10K+ servers with RAID with 10+ hard drives and do not buy BBU. It does not cost a lot these days and even if your load is read only you’ve got to load data or restore from backup sometime.
Hey peter,
Great details, thanks!
Does this mean that *not* using O_DIRECT could be the right way to go, contrary to popular opinion, if you want to have multiple I/O to a single file in flight? With InnoDB, even with file_per_table, I would think you’d want multiple I/O to the iblog* files and probably ibdata*, not to mention busy tables.
Or am I misunderstanding your results?
Don,
As you can see if there is BBU the serialization with O_DIRECT does not hurt that much for large working set and small one mainly correspond to log writes which are done serially anyway.
Log files are only written to the end anyway so large number of them does not help.
If you do not have BBU when O_DIRECT already can be slowing your down because Innodb performs writes in background basically flushing pages one after another.
It would be interesting to check it with RHEL5 if serialization is still in place. I thought it is already long gone… I reported it last time to kernel Developers and they told me it will be fixed. It was long ago.
The IO scheduler matters for this test as well, even when using O_DIRECT-deadline should win. What is your IO scheduler setting? Most real-world cases the Cache on the controller is not a win and is really just used for making sure outstanding writes are not lost on crash. How did you fully remove filecache from the equation to make sure that the 256 MB test is not skewed by the system cache?
So, why choose RAID-5 for this test, when RAID-10 is perfect for random IO? For testing disks set-up to get theoretical throughput, have you tried RAW device writing?
What’s your method for achieving theoretical throughput to verify your test is testing the RAID-setup itself and not the various OS Layers in front of the disks?
Dathan,
First I should say the point of this article was not to show the best configuration out where but instead to show how you can measure performance and what numbers you should look at.
I just used the box which I had available for test which happened to have this OS setup or RAID level.
This box used cfq scheduler which works pretty well for workloads like this.
Honestly I’m surprised by the claim cache on controller is not a win. What particular workload you’re speaking about and what storage engine ? For MyISAM it well may be the case because it does not use synchronous IO but even in this case it is going to help the journaling file system you’re likely to use.
I am also not sure what do you mean by outstanding writes are not lost on the crash. Which outsanding writes ? To the OS device, to the RAID controller or to the disk drive ? Buffered or Unbuffered.
If you have buffered writes which are going to OS cache they can be lost on power down anyway. If you’re speaking about unbuffered IO to device (O_SYNC, O_DIRECT) they must not be lost after call returns. If they are you have configured something incorrectly (ie forced Writeback cache mode on the RAID without BBU) and this is exactly what should be avoided as loss in this case will corrupt Innodb database and even with MyISAM you’re risking journaling file system corruption.
File system cache is removed from equation because of use of O_DIRECT which bypasses OS cache for reads and writes. 256MB cache on controller can still potentially skew read results for 2GB tests if read cache is enabled on controller but this is not that large.
Regarding theoretical number I usually just look at the specs and see how close we get. Even if you test raw RAID partition you’re not testing drives you’re testing RAID controller with its drivers anyway. This is why I prefer to test the whole stack and see how it makes sense.
Ah so this article is on what numbers to look at in Sysbench.
INNODB with 200 GB of data.
The memory on the controller helps, but I’m not going to get enough read or write performance from it where it helps end query performance of a mysql database. On a six disk RAID-10 15K RPM system, 22 MB/sec of random reads in a mixed read/write setup is just as good as it gets, the biggest win I receive from the battery backed cache is knowing that the last IO-OP from the controller made it to the disk, as well as flushes are less expensive. So, having a 256 MB controller in not going to reduce Response time for an application when doing filesorts or temp tables – which many people think.
Also just for fun here is a good link on what Linus Torvalds thinks of O_DIRECT.
http://kerneltrap.org/node/7563
Dathan,
First 200GB Innodb does not tell me much about workload – how many read writes and also how did you figure out what there is no gain from having BBU ?
In your case it is well possible Reads are what limits you in this case you would not get much benefit from BBU.
The benefits from BBU comes in number of ways – your Innodb log writes will be faster (of course if you do not care much about transactions and running with innodb_flush_logs_at_trx_commit=0 it does not apply much) and it will make flushes from buffer pool faster and also give RAID more room to optimize requests so writes would impact reads.
Temporary tables and Sort files are absolutely unrelated to BBU because these use MyISAM and so buffered IO so writes to them can be cached (and delayed) by operating system. Though good way to optimize them to keep them on tmpfs unless you have too large sorts to be able to do so.
Regarding O_DIRECT and Linus I know, so what ? The point is what works right now not what in theory would be best solution.
Linus feels PageCache should be used to cache data instead of Innodb Buffer Pool, well this first would require to use mmap IO because you can’t pay system call penalty for access to cached data, which makes it quite tricky to handle IO failure operations. But what is even more DataBases need very strict control about flush order which PageCache does not provide.
So the way which Linus thinks is a good way to go is not supported by MySQL/Innodb right now and neither can be in the near future.
I run sysbench on our new hardware before and did not notice much difference between cfq and deadline schedulers. Here is an article from redhat’s website . http://www.redhat.com/magazine/008jun05/features/schedulers/ And it is probably why cfq is RHEL’s default scheduler now.
We had a couple of DB servers’ disk controller battery dead and they started to crawl. So BBU is a must-have for DB servers.
I tested this on a raid 10, serverraid 8i with 256M and got exactly 12897.22 Requests/sec executed on some runs and not others on the read test. It seems wierd to get the exact number. It’s intresting to see the controler diffrence with the various cache’s on and off, I assumed read cache off would be better for mysql?
@ Peter: I totally agree of the benefits of BBU-I am agreeing with you, I’m just injecting that having 256MB of cache on a controller is not going to give any more throughput (T) and it will not solve problems that schema design will.
I’m doing around 200K selects (various large ranges, primary key lookups, secondary key lookups) per second across a few mysql servers, with 70% read / 30% write ratio – without using O_DIRECT (can’t because its broken in my kernel version). The BBU helps, but to attain such speed the RAID setup and schema-use is where I am able to attain 22MB a sec of random read I/O and 10 MB a sec of random write I/O on 3 effective spindles for writes and 6 for reads-with less then 10% of the CPU in WIO.
Recently the cache was turned off and there was a hit on performance but that was related to flushing of IOPS so the WIO went up increasing load – yet I was still able to sustain 22MB a sec of random I/O. This is the behavior I noticed (which was about a year ago)
So I agree that BBU is needed for better responsiveness R(t), but it’s not directly but indirectly associated with throughput (T).
I would love to see the same test with different cache profiles that the PERC-5 offers. Also a test with the cache turned off to see the difference.
Mark,
Exactly same number is strange as it varied even for me between runs. Being close however can be explained – for example we can come to the latency of PCI bus or something similar.
Right read cache should be normally off for MySQL – you want to save your RAID cache for writes.
My experience is that with a BBU on a RAID unit (whether controller or disk stack), the cfq elevator comes with a ~20% performance penalty. Why? Because unless you’re exposing the spindles to the OS one-by-one (or in RAID-1 pairs striped with MD/LVM), you’re having cfq order all the random reads/writes to the disk within one I/O queue, practically guaranteeing only a maximum of two, perhaps three of your n spindles is performing any useful work.
The noop scheduler, on the other hand, lets the requests go to the RAID (and BBU cache in the case of writes) in random order, letting it utilize all spindles to their best effect.
Osma,
What are you comparing ? System without BBU to system with BBU ? Why would it cause performance degradation even if what you’re describing would take place ?
What RAID did you tested ? Normally you will have TCQ or similar enabled so there is multiple of outstanding command issued to device.
I’m glad you asked, Peter. Since the explanation is on the longish side, I just posted it on my blog at http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID. Would love to hear your comments.
Very very interesting post! It help me a lot to find what cause a High IOwait problem, resulting poor performance. I was not familliar with SysBench and now I use it on every servers I manage. I love it.
I use RAID 10 with no BBU. Intense writing is slower on it than my old server with a single Disk, no raid. The performance also decreases when I tried O_DIRECT to flush the log so I switched back to the default. I didn’t try with BBU, but if I can’t get all the thing faster, i will try (and buy a battery).
Writes can be slower with O_DIRECT if you do not have BBU. However it is often not that bad as log write will require single IO anyway and the optimization for dirty page flushes may not affect subjective performance because it is happening in background.
holy cow thank you for this. You just helped me prove conclusively (and finally!) that a mysterious I/O problem was my host’s and not, as they claimed, bad code.
Sudheer,
Enabling binary log breaks group commit which is critical with group commit. you can try setting innodb_flush_log_at_trx_commit=2
Hi Peter,
I am facing performance issue after enable bin-log in my server. I am using RAID 10 , O_DIRECT and with out BBU. My application processes 2500 transaction per second (90 % inserts 9 % updates only 1% select and commit command), after enabling bin-log only insert commands are getting slow (only 550 inserts per sec) and application query getting pile up. we have 65 % MyISAM and 35% InnoDB engine tables.
Is this problem with O_DIRECT? Can I disable O_DIRECT?