Let us talk few more about disks. You might have read my previous post and Matt’s Reply and it looks like there are few more things to clarify and explain.

Before I get to main topic of the article lets comment on IO vs Disk question. If you look at Disk Based databases all data accesses are treated as IOs – it can be “logical” if they are cached or “phyiscal” if they require actual IO done but in the general sense all data accesses are IOs. If you use this terminology when most of the problems would come down to IO – making queries to touch fewer rows (or row portions) or having these “touches” resolved as logical IO rather than physical. There is still locking ,networking etc to deal with but it is minor story.

This is not however as Most of the people understand IO and as not as I typically use these terms. For me IO is IO bound workload – disks are moving and CPU sits idle. With such terminology there is instantly much smaller amount of cases are about IO because we would call cases when too much of logical IO is happening CPU bound. The beauty of this terminlogy (and so why I use it) – it is very easy to see if system is IO bound or CPU bound, while to understand if MySQL goes through more rows than it needs to requires look at the queries and schema.

Ok Let us new get to back to main point of the article.

In original article I mentioned having multiple hard drives does not help if you have single query (or stream of queries) which you need to deal with. This is indeed not exactly the case – my point is you should not expect so much gains as you would expect having say 8 hard drive instead of one.

Let us first look how single query is executed for Innodb storage engine to be more specific. Lets look at update queries (from replication thread for example). When update is performed first problem is actually reading the data. If you’re updating the row you need to fetch the page containing old row version (and possibly index pages you’re to modify) – even if you’re doing INSERT you will need to fetch clustered index page at least to do it. These reads are issues by Innodb one by one – next read request can only be issued after previous request is completed.

Innodb tries to optimize these reads a bit – there is sequential read ahead and random read ahead which are designed to spot data access patterns and preform the data before it is needed. They can execute in the parallel to normal read operations issued by the thread and so can result in multiple outstanding requests to the disk. Though this does not help dramatically for many “random” update queries.

After pages are fetched and modified they need to be written to the disk, this however does not happen at once – the thread executing request does not have to wait for the dirty pages to flush – it happens in the background on its own schedule. Such flush activity is another activity which will happen in parallel even if you have single running query. Though this one is specific to write requests – if you have some reporting queries to deal with you will not benefit of this parallelization.

Of course another thing update query needs to do is to flush transactional log. In case innodb_flush_log_at_trx_commit=1 (default) this will be synchronous operation and thread will need to wait for it to complete before continuing. Though in decent systems you have battery backed up cache so this wait is not long. In case innodb_flush_log_at_trx_commit is set to 0 or 2 physical IO will be happening in background giving yet another request which can be executed in parallel in the background.

So first thing to consider – single client work load does not necessary means single outstanding disk IO request at all times. So natively it can benefit from multiple hard drives this way.

But what is about case when you have only one IO request to deal with ? In this case you also can benefit from multiple drives because of the different reason. If you’re using RAID10 – most commonly used RAID level for write intensive database systems, you will have 2 hard drives to pick reading any block. RAID controller (and even software RAID driver) typically tracks disk head position so it would pick to read from the drive with shorter seek time if both of hard drives are idle.

The other seek time optimizations come from the striping – consider 100GB database which is stored on single disk. Assuming it is not fragmented it will take N tracks on this disk. Now if we stripe the same data among couple of hard drives it will take N/2 tracks so average seek time will be shorter. Similar affect happens to using larger hard drives – If you put 100GB on 100GB hard drive you will have it spread across all tracks, if you use 100GB non fragmented block on 1TB hard drive you will have 10 seeks condensed in 1/10 of the full seek distance.

I must say however some people expect too much improvements from optimizing seeks. You can have shorter seeks but as soon as you have any seeks you will be far from sequential disk access performance. First you should consider what has to happen with hard drive to read/write the data you have requested. Drive has to seek to the proper track and you need drive to rotate to the place where your data is stored to perform the IO operation. When we’re speaking about seeks we imply this rotation as well though people often forget to account for it. If we take 15K RPM Hard Drive it does 250 rotations per second (4ms per rotation) which gives us 2ms of average latency – exactly what you can see in the specs. Looking at the same specs (reads) we get 0.2ms track to track seek time (our best case) and 3.5ms average seek time which is significant difference. Now if we add 2ms of average latency which we have to deal with in both cases we have 2.2 vs 5.5ms which is 2.5 times difference – This is a lot, but remember this was the best case. Typically you will not be able to get over 30-50% from the second hard drive (and diminishing returns if you keep more copies). So sick time is worth optimizing but do not expect magic because latency is not going away.

Let us get back to multiple outstanding IOs case – Matt points out the issue of disk contention people constantly forgot about. If you ask most people what would be faster for random reads (writes are obvious) – RAID0 or RAID1 most will think they are about the same – in both cases you have 2 disks to deal with. There is the serious difference however – if you have RAID1 for ANY request you can use any disk to perform the read request while with RAID0 it is only the disk which has the data (even keeping aside partial “border” IO request which will require reading both drives). If you have 2 random outstanding IO requests there is 50% chance they will require block from the same drive and will need to be serialized if you’re using RAID0 with 2 drives.

This number improves as you’re getting more drives because there is less chance 2 requests will hit the same drive as well as with increased concurrency. If you would have 256 concurrent IOs for example this effect will almost disappear. This is why I think people often do not see this difference – often IO subsystem capacity is tested with single thread and with some high amount of threads which do not show this effect well.

So as you might see tuning IO subsystem can indeed be fun – there is a lot to deal with not even mentioning various seriolization, stripe sizes, cache policies, filesystem and OS issues 🙂

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Matt yonkovit

Nice follow-up peter. For me I am looking beyond the single thread IO bottleneck, and mainly concerned with customer workloads. In these cases I can see AWAIT times that approach some insane levels. In fact I want to say maybe 80% of the people who I deal with have substandard latency coming from their IO subsystem. I don’t think I ever have encountered a single threaded latency issue. So mostly I see High queues, tons of requests, high wait times, etc. Sometimes adding more memory gets the job done, but many times they are maxed out.

So its all about the IO ( contention ) !

Marki

And what about SAN disk arrays like EVA? They use all available disk spindles for every LUN – so when you have 30x 73 GB 15k disks and create 100 GB LUN, it will distributed over all 30 disks. I don’t remember the chunk size now (it is 32 MB, but I think the piece which is stripped is 2 MB). So there is very big chance that every request will come to different disk. Also you have mirroring, so every chunk can be read from 2 drives.

David Holoboff

Peter,

Do you have any posts that compare filesystem and IO schedulers, and perhaps a SATA vs. SCSI comparison?