As you might know Innodb storage engines uses Fuzzy Checkpointing technique as part of it recovery strategy.  It is very nice approach which means database never needs to “stall” to perform total modified pages flush but instead flushing of dirty pages happens gradually in small chunks so database load is very even.

This works great in most cases, however recently while running benchmarks with our partners we’ve observed very different behavior – Instead of being spread over time at certain time Innodb starts to flush pages very agressively seriously slowing down or even almost stalling for  few minutes.  Finally we were able to track down the bug.

Here is what is happening.  Then planing fuzzy checkpoints Innodb assumes kind of uniform distribution of LSN (Log Sequence Number) for the pages which basically corresponds to age of unflushed pages. In such case current implementation works just fine – Innodb flush threads selects range of LSN to flush each round which is small enough.   In certain workloads however (happily most of these are benchmarks)    just after very large amount of pages are modified at once.   In this case  many pages fall in the same LSN range scheduled for flush and Innodb might end up flushing most of buffer pool pages.      Now as the pages are flushed they get modified again, and again in very short period of time so process repeats itself.
There is one more related bug which makes the problem worse. Technically if there are too many dirty pages Innodb should start doing checkpoints more actively, this is if there are more than 70% of pages dirty.  However if your database is small  (ie only size of allocated buffer pool size) this might never happen.  This is very unusual case of increasing your buffer pool size actually may decrease performance.

What I would do to fix it ?   I think fuzzy checkpointing should be dynamic and based at amount of log space free rather than LSN numbers. For  example we may be flushing by N pages per round until log files are 50% full (the target can be different), then we increse the number of flushed pages progressively as log free space drops, making sure it never reaches 100%. We could also have low water value, ie 30% utilized at which we can stop checkpoint kind  flushes, which would avoid uneeded IO.

One more related problem Innodb has – too many hard coded numbers.  For example Innodb will count number of IOs done to see how much disk IO system is loaded and compare it to constant.  However the shared drive at shared MySQL virtual hosting and powerful RAID system may have very different performance properties.  This means for example if you have RAID system which does 5000 IOs/sec and you have relatively idle interval of  performing just 500 IOs/sec  Innodb will not catch it as such and will not use it to flush dirty buffers aggressively and perform other needy tasks, which might decrease its performance from where it could be when peak load returns.

This is actually the question I do not have good answer for. How can we detect how good IO subsystem do we have and how much load it can sustain.  The problem becomes even worse as with SAN or simply shared disk  available bandwidth may be veriable.   Can we look at  latency of submited requests ?  Probbably but we need to consider caching by OS as well as disk volume itsef.   Should we get utilization ratios from OS (ie iostat) – this is nice but needs to be done for all OSes which makes it less portable. Also utilization does not give us enough data – 100% “utiization” corresponds both to 1 or 20 outstanding requests which  are very different.  One however probably can combine this data with device queue sizes and average request execution time to get some decent values.  Anyway it should be better than fixed constants.

At very least I would like to get the control about this in my hands and being able to say this device can handle 1000 requests per second and the best would be to keep 16 outstanding requests at the time.   The number of outstanding requests is also very important -  if you do not submit enough concurrent requests IO subsystems will not be able to perform at their full capacity.   One more thing to take into account.

Now some good news. As I mentioned this Fuzzy checkpointing issue does not happen that frequently in real workloads.  If you run into what you think is one please help us by commenting on this bug. Real word use case reports are best way to boost priority of such performance bugs.