I read very nice post by Matt today and it has many good insights though I can’t say I agree on all points.
First there is a lot of people out where which put it as disk is everything. Remember Paul Tuckfield saying “You should ask how many disks they have instead of how many systems they have” on MySQL UC2008 Scalability Panel ? Indeed disks MAY be the most important part in your system performance or it may not be. Different people get to deal with different systems and so acquire different feeling about percentage of cases when disk would be the problem.

However it is not always the case. There are whole classes of systems where Disk performance is not that important – consider for example systems where most of the database fits in memory. These days we can get 64G of memory for pretty commodity prices and this allows you to get a lot of data in. Many Web 2.0 sites in particular design for having everything they need in memory, which is not that hard after you got VC funding as soon as you can shard your data properly so you would not have to replicate everything 🙂

There are also cases when storage size becomes the limit rather than number of IOs it can handle. Imagine for example data archive storage. On BoardReader for example we’re limited by space and how much data we can comfortably put in MySQL Instance without getting in too much trouble with backups etc – because Sphinx is doing all search heavy lifting we only need to insert new data and fetch few rows to show search results which does not require too much IO capacity or high in memory fit.

For large number of applications optimizing IO performance will be number one problem do not get me wrong, I just it is not always the case.

Another mistake is to measure IO capacity in spindles – So your hard drive can do 200 random reads per second and your SSD drive can do 10000 does it mean if you have array with 50 hard drives it would perform as good as SSD ? Not really. Leaving all potential serialization issues along you need enough concurrency to utilize multiple hard drives and with 50 drives you would need at least 50 outstanding requests all the time to fully utilize them. So for example these 50 drives will unlikely be helpful solving replication delay or speeding up this 3 hours reporting query or 5 hours ALTER TABLE
command.

Let me also comment on the memory fit diagram – understanding your working set is paramount but taking number from other application is one of the worst mistakes you can make. One application can become CPU bound even with 5% of total data set in memory while over may require full 100%. Performance gains graphs in relation to fit in memory will also be different.

Now let me come to filesystem layout issues This is where a lot of roads meet and you have to consider a lot of topics – security, manageability, performance. I tend to advice different things depending on what is the most important. Assuming you have fixed amount of disk at your disposal and if you have BBU most likely data files is where all IO will go. If you dedicate RAID1 volume for Operating system and another one for Transactional Logs you will often wast 4 hard drives from performance standpoint without a good reason.

Having OS and tmp away of the data is a good idea as Matt says – you do not want your runaway logs or extreme temporary space usage to stop the database, but they usually can keep the same RAID volume.
I tend to keep Innodb data files (and whole datadir) and log files on the same partition as well because this makes it very easy to use LVM for backups while assuming you have BBU on the raid volume impact on transactional log write speed is usually minimal. At the same time if reliability is desired it might be good idea to keep Binary Logs on the separate volume – you may need them to do point in time recovery if you want to recover to last committed transaction in case your data RAID volume was trashed.

Regarding SWAP – I tend to have it but not optimize for it, putting it on shared drive (or OS+Stuff drive if there is one) – you do not want your database box to swap actively – if it does your’re in trouble anyway so it is better to fix the things rather than optimize swap performance.

I also fully support Matt in his views of black boxes. If you want to be able to resolve your performance problems you need transparency, you need to have X-Ray vision, you need to understand what is happening and why. A lot of pieces – filesystem, disk scheduling, database buffer management can be rather complex but you need to have at least basic understanding of the processes to be able to reliably identify problems.
At the same time I would be very careful with your assumptions – in many instances I’ve seen different parts of the system working not as I would expect them to and not as it would make sense in my opinion but quite differently.

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
daniel

> If you dedicate RAID1 volume for Operating system and another one for Transactional Logs you will often wast 4 hard drives from performance standpoint without a good reason.
> I tend to keep Innodb data files (and whole datadir) and log files on the same partition as well because this makes it very easy to use LVM for backups while assuming you have BBU on the raid volume impact on transactional log write speed is usually minimal.

I definitely support these statements from testing and experience.

This reminds me of a discussion with a virtualization vendor about IO rate limiting, of which the main point was that if you force specific IO requests to be reordered based on userspace priorities, you prevent the kernel (and bbwc units) from properly serializing IOs based on physical disk layout and end up with a large net reduction in disk performance. I believe a fairly similar case is also true when data/logs are physically separated (artificially “rate limiting” — may be imagined as “protecting” — io by splitting data/logs) and disk writes are not allowed to be globally managed; you’ll end up losing a lot of IO performance unnecessarily.

Only in sensitive situations where periodic backups would not be sufficient for recovery (financial etc.) would separate partitioning be necessary, as you mentioned.

Ries van Twisk

Any user needs to take a look at what he does with his database, if you do mostly reads on your database, then for sure having 64GB of memory will help you a lot during your queries, sorting and doing joins…

However if you are doing a lot of write to your database then IO performance is everything in the end of the transaction you want your data to be save on your HD, not floating around in memory.

The article above needs to clarify about what is needed under what condition…

Ries

Matt yonkovit

I followed up a bit on my blog about this. It is a worthwhile discussion to have. At the UC I don’t remember who on the panel during the scale out/up keynote said it ( it was not you tube, I think it was one of the photo sites ) said that they were constantly battling with random IO. IO does not always mean disk either… Anyones thanks for comments.

Ries van Twisk

,

in my case with IO I mean the channel to HD… Random IO is indeed a hard problem to tackle… There are some nice disk controllers with a heap of memory that can hanlde random IO better then disk controllers with less IO. Just pick a disk controller with a little battery so it can cope with power faults (Disk writes will continue from on-board memory as soon as power comes back up).