I mentioned earlier that IO scheduler CFQ coming by default in RedHat / CentOS 5.x may be not so good for MySQL. And yesterday one customer reported that just changing cfq to noop solved their InnoDB IO problems. I ran tpcc scripts against XtraDB on our Dell PowerEdge R900 server (16 cores, 8 disks in RAID10, controller Perc/6i with BBU) to compare cfq, deadline, noop and anticipatory (last one just to get number, I did not expect a lot from anticipatory).

Here is result (in transactions per minute, more is better):

cfq2793.5
noop6586.4
deadline6513.7
anticipatory1465

Here is graph of disk writes (column bo in vmstat) during benchmark

As you see noop / deadline can utilize disks much better.

For reference I used tpcc scripts from https://launchpad.net/perconatools, generated 100W (about 9.5GB of data on disk), and used next XtraDB params:

19 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Markus Blaschke

CentOS 5.2 is using Kernel 2.6.18, right?
The CFQ-Scheduler was rewritten after 2.6.18 so it would be nice to see the performance changes after 2.6.18, let’s try 2.6.27 or 2.6.28 🙂

benpi

I suspect results would be different (maybe still in favour of noop, but probably not that much) without a BBU unit (and therefore without O_DIRECT), or with controler configured for write-through, or with a cheaper controler, and perhaps with a more write costly RAID discipline (ie. RAID 5). A setup with storage accessible through multiple LUNs and/or multiple controlers may also be a good use case for cfq. On the other hand, results may be even more favouring noop if you used SSD drives. I mean, chances are that those results are somehow hardware dependants (though I think the one you used for this test has the merit to be a quite typical setup).

How much RAM do you have on this PERC6 (I think Dell as options for 256 MB and 512 MB, at least)?

By the way, did someone already benchmarked how the quantity of RAM in the controler do impact some I/O bound loads with MySQL

pat

This is interesting stuff. Can anybody speculate on the variables at work here? Would I expect to get similar results on RAID vs single drives vs SAN vs NFS, etc?

Is it reasonable to generalize a benchmark like this and say:

Looks like noop is the best scheduler for an INNODB server?

Or is this a case of run it yourself and test because your mileage may vary?

Domas Mituzas

I have detailed why this happens at http://dammit.lt/2008/02/05/linux-io-schedulers/ – though it all started simply when we compared how I/O behaves on identical machines ( http://dammit.lt/2007/12/13/io-scheduler-deadline-cfq/ )

Peter Zaitsev

Vadim,

Did you measure queue depth and request latency by chance ?

Peter Zaitsev

In my experience results for schedulers are often rather platform and workload specific. For this hardware and benchmark we have such results. In other conditions I’ve seen other schedulers being better.

Results are not surprising smart RAID with cache and multiple devices is smart enough by itself to optimize IO, it also has much more info about storage so can do it much more efficiently. It is surprising the gap between deadline and noop is so small.

Peter Zaitsev

benpi,

An interesting note about O_DIRECT though… and so how it can affect results. In this kernel (at least) O_DIRECT serialized writes on inode level which can affect things significantly.

I also did not quite get the point as without BBU and so without O_DIRECT – even if you do not have BBU O_DIRECT is often very helpful allowing to avoid double buffering. It may be slower though in case workload is write bound because IO serialization for O_DIRECT.

Domas Mituzas

IO serialization doesn’t happen with XFS 🙂

And there’s no difference between noop and deadline is simply because deadline is noop with starve protection. If you do not starve too much, there’s not much need for deadline.

benpi

Vadim,
fair enough, RAID 10 + BBU is probably the most common and natural choice for write stressed servers (ps: I asked about RAM size because I still don’t unders
tand why hardware vendors don’t load them with lots more RAM (even if this imply a bigger battery, that shouldn’t cost much more), a possible answer would be
that performances wouldn’t improve much after a certain size is reached).

Domas,
does this differs from ext3’s data=writeback side effects?

Peter,
my comment about O_DIRECT w/o BBU was based on the (very naive and unverified) believing that Linux I/O schedulers do reorder based on kernel’s page cache content (so, it seems I was wrong), and on the fact that software based reordering matters when one don’t have hardware to handle it, MySQL using O_DIRECT to move data from sequentially written log files to the more write-costly spreaded/random innodb data file. Whatever, I stand corrected now, thank you 🙂

You are also very on point when you bring up latency in the discussion (and also queues size/depths); indeed scheduling is not only about aggregating write requests and bulk performances, but also about arbitrating resources usage among consumers and making decisions about avg vs. worst case latencies vs. throughput, which sometimes does matters.

I suspect a mixed workload (alternated write intensives, reads, and cpu bound moments, not just huge sustained writes) would give controller better chances to reorder properly before cache being filled up. If so, scheduler added complexity may be even more a waste, but in those conditions, user’s perceived performances may be more reflected by per-request time to completion (so, latency). I mean workloads including possibles concurrents reads and cpu work, idling moments, etc. as opposed to steady, bulk, write-only workloads (ie. importing dumps), where one may rather account the total, wall clock time needed to write the whole dataset without interruptions. Having a nicely interactive ajax interface stuck because some bulk thing does fill the BBU cache, the I/O queue and the innodb log file in the background may not be the expected behaviour, even if it gives a lower cumulated time to complete (therefore more TPS).

On this matter: thanks to the recent work on cgroups and containers resources control, Linux got a lot of refining in order to expose and enforce I/O priorities and arbitration. I’m not sure whether MySQL do leverage those features though (maybe it would be useful to lower priority for logs->data moves, or to increase priorities when big locks are held, or when deadlocks starts to happen?) neither how well do the different I/O elevators handles those features. And to the best of my knowledge, SQL does lack a way to specify expected QoS per request (I mean, something like TCP/IP TOS field) which would allow exposing this to consumers (although MySQL offers INSERT DELAYED).

Back to the subject, I don’t know anything about tpcc (and even less about how it’s used here): does this bench really spreads writes to emulate some randomness (wrt block device layout) or is it more of a “plain sequential bulk insert/update on a very few tables” thing?

Seekwatcher (http://oss.oracle.com/~mason/seekwatcher/) graphs would be interesting for such benchmarks, if anything to show what patterns you actually do measure (that’s to insidiously revive the “Tools” post ;).

Peter Zaitsev

Domas,

Yeah XFS rocks in terms of not serializing O_DIRECT writes and also does not have contention problems on meta data updates
http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/

The problem is most people run on EXT3 and it is hard to make them change 🙂

Peter Zaitsev

benpi,

Did Linux get per request priorities finally ? I worked with an University some time ago to see how IO priorities can be useful and they were quite helpful allowing to work with large queue sizes without starving log writes etc.

Now about disk IO scheduling and cache. Really it is a bit different though connected. The dirty pages to be flushed are picked by the kernel (background flush or forced by fsync) and when submitted to the kernel together with reads request. If you have a lot of dirty pages there is indeed a better chance to optimize IO.

benpi

Peter,
yes, that’s why I got mistaken: by thinking O_DIRECT would prevent to-be-written data to go from userland through kernel page cache before being sent to the controler, writes wouldn’t be well scheduled (and believing in consequence that it would be a huge loss on bbu-less systems). So I was triply dumb because the alternative to O_DIRECT is fsync()ing the data file after write, which doesn’t leave much room for merging requests either, and because this (O_DIRECT) only apply for data moved from logs to data file (MySQL can probably order this well already), and also because schedulers (at least cfq) are also able to arbitrate which requests (not only “which pages”), even direct ones, can go to some device at any moment (so even then scheduler shouldn’t be totaly bypassed if I’m not mistaking again).

And sadly no, theres no per request priorities yet as far as I know (not even per posix thread yet, but just per process, which indeed is way more pratical to use for PostgreSQL than it is for MySQL right now, but that may deserve a small little smallish tiny fork, maybe ;). Same goes for ioprio_set(2).

dermoth

I used Deadline with MySQL for many years… Recently I wanted to test the CFQ Idle class for an IO-intensive job running everyday and causing a slight lag in replication. The end-result was even worse, causing more lag in MySQL by running the job faster.

With <2.6.24 kernels I even experienced a bug in which both MySQL and the job were running ridiculously slow with barely no IO operations on the RAID.

Might be different for other subsystems, but for be deadline just rocks, even without BBU (6-drive Raid10 though)

steven

benpi: Is this what you’re referring to in your first message ? It makes sense to me and this is what I think is happening in this case with CFQ :

http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID

(read the last part of the article).

Fredrik Widlund

Stumbled on this thread. I’m benchmarking fs/schedulers on a small HW RAID-5 set of 4x400GB SATA. It’s a lab-setup used for parallel sequential streams and not DB usage. Raid adapter is PERC6/E, 1MB raid element size, Cached, Adaptive Read-ahead. Kernel 2.6.31.5 (Arch Linux).

Very short recap with some numbers (40 simultaneous seq reads):
XFS+noop: 140MB/s
EXT3+noop: 133MB/s
EXT4+noop: 111MB/s (exploded once with corrupt data when changing scheduler)
EXT2+noop: 97MB/s
XFS+cfq: 30MB/s

Some comments on this setup:
– XFS beats EXT*, with EXT4 performing badly and also being unstable
– cfq vs noop makes a *lot* of difference. In this setup cfq seems close to broken. deadline is similar to noop.