I’m often asked how one can evaluate IO subsystem (Hard drive RAID or SAN) performance for MySQL needs so I’ve decided to write some simple steps you can take to get a good feeling about it, it is not perfect but usually can tell you quite a lot of what you should expect from the system.

What I usually look for MySQL is performance in random reads and random writes. Sequential reads and writes are rarely the problem for OLTP workloads, so we will not look at them.

I also prefer to look at performance with O_DIRECT flag set to bypass OS cache. This may execute separate code path in kernel and so has a bit different performance pattern compared to buffered IO (even followed by fsync regularly) , but it allows to easily bypass OS cache both for reads and for writes and so does not require creating large working sets for boxes with significant amounts of memory (or reducing amount of usable memory).

The system I’m testing this on has 256MB BBU (Battery Backed up Cache) on RAID controller so we will test two workloads size – first one is small which fits in the cache and the second one will be large enough so it does not. This allows us to see both by RAID cache and uncached IO performance.

We’re interested in cached reads because they may show latency to the RAID cache if RAID read cache is enabled. If it is disabled you may be reading from Drives cache or even drives themselves which will affect performance significantly. In any case especially playing with data size a bit you will well learn how your cache behaves.

Were even more so interested about writes which show us how many cached writes per second we can do – this is important for log writes, which are synchronous IO operation database performance which typically have rather close data locality.

We’re also interested in uncached reads and writes because this correspond to general database workload.

It is worth to run the test with 1 thread and with some higher number (say 64) to see how things scale.

The tool I’m using for this is SysBench which was designed by my team when I still worked for MySQL and we specially implemented bunch of tests to ease hardware evaluation for things which are important for MySQL.

To prepare small 128MB single file working set we can use the following command:

And when we can run the test:

The output is pretty verbose so it is quite scary for many people, however there is basically one number important here 12897.22 req/sec – so the RAID cache on this drive can do about 13.000 16K reads/sec from the cache, not bad at all.

With 64 threads I get 23727.46 req/sec which shows even in such cached mode you can get better performance by having many outstanding requests.

Running with more than one threads you may also take a look at another table SysBench displays:

So we get about 3ms 95 percentile request time – not bad – all request were executed with pretty uniform performance.

Let us now do the test with single thread and write-through RAID cache (as if there would not be any BBU)

Auch… We get 142.62 Requests/sec which is extremely poor considering this is 8 drives array. You would likely get a bit better with sequential log writes but it is not going to be much higher.

Lets see what we get with 64 threads… hm again 142.83 Requests/sec and looking at VMSTAT I can see only 1 blocked process all the time which does not sounds right.

Honestly I did not expect this result while running benchmarks for this articles (and this is always very exciting to find something unexpected while doing Benchmarks). Fortunately I’ve already seen this some time ago though I hoped this issue is long fixed since that time… Though this box is running CentOS4 which is not the newest OS out where.

The problem is basically in case of O_DIRECT writes there is only one IO can be happening per file at any given time. To show this is the problem let us create working set consisting out of number of files and repeat the run:

Now we get 827.94 Requests/sec requests per second, which is not bad considering this is RAID5 volume.

Let us now repeat the tests for WriteBack cache configuration:

We get 4735.27 Requests/sec from 1 thread and single file; 4740.19 Requests/sec from 64 threads and 1 file and 5228.68 Requests/sec with 64 threads and 128 files.

As you can see numbers are a lot better especially when there is single thread doing IO (just as with database log). We can see serialization with O_DIRECT however the impact is much less in this case compared to Write Through mode because we’re mostly limited by RAID controller capacity.

Let us now do test with larger test set. We’ll use 2GB in 128 files (because we already discovered synchronization issue with O_DIRECT). This sounds small but as controller cache is just 256M it should be good enough.

There is also one trick running such larger scale benchmarks. SysBench by default has repeatable sequence of requests, because it does not initialize random generator. If you’re doing multiple runs make sure to add –init-rng=1 option otherwise results can be skewed.

So we do run as:

For single thread we get 259.28 Requests/sec which approximately matches what single drive can do. This is important lesson – even if you have many hard drives you can’t really get advantage of them unless you can get multiple outstanding requests.

Performance with 64 threads was 2334.63 Requests/sec which is perfect scalability to 6 hard drive this array has. We have even a bit more than 8x because 64 outstanding requests allow RAID controller to optimize seeks.

Lets us see how writes are doing:

For writes we’re getting 583.43 Requests/sec for 1 thread and 605.88 Requests/sec for 64 threads.

As you can see we got more writes for single thread than we had reads and also using multiple threads for writes did not improve performance as dramatically as for reads. This is because as we have BBU writes are just stored in controller cache and flushed to the disk in background – in this case there is little difference if we use one thread or multiple threads.

As you also can see BBU is helpful even for random writes if they are performed from single thread – so it does not only help database log writes but has much larger positive impact on workload.

I should also note it is very good idea to watch VMSTAT while doing the test. Here is the snipped from good VMSTAT from the read test:

As you can see all 64 threads are waiting for IO. Now lest see our write test which seems to have some serialization issues

You can get similar information by analyzing iostat numbers and that well may be more accurate. Though looking at VMSTAT is often good enough to see the problem.

P.S If you’re wondering which hardware I used for benchmarks – it is Dell PowerEdge 2950 with PERC5 and 6 10K RPM hard drives in RAID5.

20 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Daniel Schneller

Battery Backed Caches are really a nice feature. Some time ago we were wondering why one of our test servers performed abysmally compared to the other few, even though they were supposed to be identical.
Only after some time we realized that that particular one had been sent without a battery pack installed…

Don MacAskill

Hey peter,

Great details, thanks!

Does this mean that *not* using O_DIRECT could be the right way to go, contrary to popular opinion, if you want to have multiple I/O to a single file in flight? With InnoDB, even with file_per_table, I would think you’d want multiple I/O to the iblog* files and probably ibdata*, not to mention busy tables.

Or am I misunderstanding your results?

Dathan Pattishall

The IO scheduler matters for this test as well, even when using O_DIRECT-deadline should win. What is your IO scheduler setting? Most real-world cases the Cache on the controller is not a win and is really just used for making sure outstanding writes are not lost on crash. How did you fully remove filecache from the equation to make sure that the 256 MB test is not skewed by the system cache?

So, why choose RAID-5 for this test, when RAID-10 is perfect for random IO? For testing disks set-up to get theoretical throughput, have you tried RAW device writing?

What’s your method for achieving theoretical throughput to verify your test is testing the RAID-setup itself and not the various OS Layers in front of the disks?

Dathan Pattishall

Ah so this article is on what numbers to look at in Sysbench.

Honestly I’m surprised by the claim cache on controller is not a win. What particular workload you’re speaking about and what storage engine ?

INNODB with 200 GB of data.

The memory on the controller helps, but I’m not going to get enough read or write performance from it where it helps end query performance of a mysql database. On a six disk RAID-10 15K RPM system, 22 MB/sec of random reads in a mixed read/write setup is just as good as it gets, the biggest win I receive from the battery backed cache is knowing that the last IO-OP from the controller made it to the disk, as well as flushes are less expensive. So, having a 256 MB controller in not going to reduce Response time for an application when doing filesorts or temp tables – which many people think.

Also just for fun here is a good link on what Linus Torvalds thinks of O_DIRECT.

http://kerneltrap.org/node/7563

Michael Jiang

I run sysbench on our new hardware before and did not notice much difference between cfq and deadline schedulers. Here is an article from redhat’s website . http://www.redhat.com/magazine/008jun05/features/schedulers/ And it is probably why cfq is RHEL’s default scheduler now.

We had a couple of DB servers’ disk controller battery dead and they started to crawl. So BBU is a must-have for DB servers.

Mark Harburn

I tested this on a raid 10, serverraid 8i with 256M and got exactly 12897.22 Requests/sec executed on some runs and not others on the read test. It seems wierd to get the exact number. It’s intresting to see the controler diffrence with the various cache’s on and off, I assumed read cache off would be better for mysql?

Dathan Pattishall

@ Peter: I totally agree of the benefits of BBU-I am agreeing with you, I’m just injecting that having 256MB of cache on a controller is not going to give any more throughput (T) and it will not solve problems that schema design will.

I’m doing around 200K selects (various large ranges, primary key lookups, secondary key lookups) per second across a few mysql servers, with 70% read / 30% write ratio – without using O_DIRECT (can’t because its broken in my kernel version). The BBU helps, but to attain such speed the RAID setup and schema-use is where I am able to attain 22MB a sec of random read I/O and 10 MB a sec of random write I/O on 3 effective spindles for writes and 6 for reads-with less then 10% of the CPU in WIO.

Recently the cache was turned off and there was a hit on performance but that was related to flushing of IOPS so the WIO went up increasing load – yet I was still able to sustain 22MB a sec of random I/O. This is the behavior I noticed (which was about a year ago)

So I agree that BBU is needed for better responsiveness R(t), but it’s not directly but indirectly associated with throughput (T).

I would love to see the same test with different cache profiles that the PERC-5 offers. Also a test with the cache turned off to see the difference.

Osma Ahvenlampi

My experience is that with a BBU on a RAID unit (whether controller or disk stack), the cfq elevator comes with a ~20% performance penalty. Why? Because unless you’re exposing the spindles to the OS one-by-one (or in RAID-1 pairs striped with MD/LVM), you’re having cfq order all the random reads/writes to the disk within one I/O queue, practically guaranteeing only a maximum of two, perhaps three of your n spindles is performing any useful work.

The noop scheduler, on the other hand, lets the requests go to the RAID (and BBU cache in the case of writes) in random order, letting it utilize all spindles to their best effect.

Osma Ahvenlampi

I’m glad you asked, Peter. Since the explanation is on the longish side, I just posted it on my blog at http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID. Would love to hear your comments.

Patrick Lafontaine

Very very interesting post! It help me a lot to find what cause a High IOwait problem, resulting poor performance. I was not familliar with SysBench and now I use it on every servers I manage. I love it.

I use RAID 10 with no BBU. Intense writing is slower on it than my old server with a single Disk, no raid. The performance also decreases when I tried O_DIRECT to flush the log so I switched back to the default. I didn’t try with BBU, but if I can’t get all the thing faster, i will try (and buy a battery).

josh vermette

holy cow thank you for this. You just helped me prove conclusively (and finally!) that a mysterious I/O problem was my host’s and not, as they claimed, bad code.

Sudheer

Hi Peter,
I am facing performance issue after enable bin-log in my server. I am using RAID 10 , O_DIRECT and with out BBU. My application processes 2500 transaction per second (90 % inserts 9 % updates only 1% select and commit command), after enabling bin-log only insert commands are getting slow (only 550 inserts per sec) and application query getting pile up. we have 65 % MyISAM and 35% InnoDB engine tables.

Is this problem with O_DIRECT? Can I disable O_DIRECT?