You may have seen in the last couple of weekly news posts that Baron mentioned we are working on a new adaptive flushing algorithm in InnoDB. In fact, we already have three such algorithms in Percona Server (reflex, estimate, keep_average). Why do we need one more? Okay, first let me start by showing the current problems, and then we will go to solutions.
The basic problem is that, unfortunately, none of the existing flushing implementations (including both MySQL native adaptive flushing and that in Percona Server) can handle it properly. Our last invention, “keep_average”, is doing a very good job on systems based on SSD/Flash storage, but it is not so good for regular slow hard drives.
Let me state the following: If you have a lot of memory (and this is not rare nowadays, for example Cisco UCS C250), your database fits into memory, and your write rate is significant, then the slow disk is, by definition, not able to keep up with all the changes you do in memory.
Here is a graph for the tpcc-mysql benchmark (100W ~ 10GB of data, 12G innodb_buffer_pool_size, 1G innodb_log_file_size),
MySQL 5.5.10 with innodb_adaptive_flushing=ON (default)). The problem here is that the system has quite slow hard drives (2 hard disks in software RAID0).

As you can see, after the initial warmup, throughput drops constantly, all the way to zero, and may stay in that state for a long time (minutes). This comes from the fact that MySQL performs changes in memory at a faster rate than it can write to disks. Eventually it gets into an “async” state, where InnoDB tries to flush as many pages as possible in order to keep the checkpoint age fitting into the InnoDB transaction logs, and this makes the situation even worse (these are the periods of zero throughput on the graph).
MySQL’s general recommendation for this case is to increase innodb_io_capacity and decrease innodb_max_dirty_pages_pct so as to have fewer dirty pages. But I call that tuning by shooting yourself in the left foot. (You can’t run fast with a broken foot, right ?) And actually it does not work, as MySQL is not able to keep the number of dirty pages within the given innodb_max_dirty_pages_pct limit.
Another possible solution would be to increase innodb_log_file_size, but that: 1) only delays the problem until later; 2) increases recovery time (and that is an important factor with slow disks); and 3) MySQL does not support innodb_log_file_size > 4GB (it is supported in Percona Server).
To make things more interesting, let’s touch on the topic of flushing of neighbor pages. Performing some research, we found it works in a way we really did not expect. My understanding was that flushing neighbor pages was implemented to perform sequential writes where possible and avoid random writes, which is quite critical for hard drives. What we found is that InnoDB does it in the following way. Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as a result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes. I do not have a good explanation for why it is implemented this way.
To make the situation even worse, the count of flushed neighbor pages is counted toward the number of pages we asked to be flushed. That is, for example, we see that to make an improvement in the checkpoint_age we need to flush 8 pages. InnoDB flushes page P and its 7 neighbors (which are not really neighbors), and then stops after this (alas, 8 pages flushed). That is, instead of flushing 8 pages what we would expect to be flushed, InnoDB flushes 1 needed page and 7 random pages, which may not even be relevant for improving the checkpoint age.
This makes calculating how many pages we need to flush extremely hard (read “impossible”), and this is one of the reasons why innodb_adaptive_flushing is not able to keep up.
What if we disable flushing of neighbor pages (we have the option innodb_flush_neighbor_pages in Percona Server)? Well, it may help to some extent, but remember that hard disks love sequential operations, and your throughput in this case may turn out significantly worse.
So what is the solution? This was exactly the topic of our research. We set these goals:
- Provide stable throughput by avoiding big jumps in flushing pages.
- Make the algorithm independent of innodb_io_capacity and innodb_max_dirty_pages_pct. (This is important for systems where I/O capacity may vary a lot, like EC2 systems, or is affected by periodic tasks like backup or heavy background queries.)
- If we see that flushing to disk is not able to keep up with changes in memory, we need to have a throttling mechanism that will limit the rate of changes in memory.
- Optimize the flushing of sequential neighbor pages in a way that makes sense.
We have made good progress, and I will keep you posted on our ideas and results.


Peter,
If you have “D.PDD” situation, where “.” is not in the buffer pool you can’t really flush the whole range with single IO, you can only do it with 2IOs. I’m not sure if in some conditions they will be performed with same disk rotation.. in my benchmarks I could not see it. Yes If the situation is “DCPDD” where C is clean page you can flush all pages we just have to be careful to see how flushing the page which was not changed will play with all other aspects of Innodb.
Patrick,
When we speak about throttling we’re speaking about some throttling done to even out performance so instead of having 100 q/sec for 5 min and when 0 for 5 min our goal is to give you
50 q/sec for 10 minutes, as much as advancing your flushing speed allows you to get.
Peter,
With the . notation I meant the page is in the buffer pool, in the post these notations are for areas of memory. However, 2 ios could be better than 5, but in this case we could do only 1 io by flushing the PDD range.
I didn’t think about other aspects and this is intresting. Do you mean that we can’t make sure the clean page won’t change while we are running the decision logic what to flush? In this specific DCPDD case, it maybe ok in some cases, but what if we want to flush a DDDCCCPDDD range? Is there any way to tell that the middle C page didn’t become dirty while we are doing all this? Apart from some kind of range flushing mutex, which doesn’t sound good. Even serializable isolation won’t provide a locking mechanism covering this.
Peter,
Right now we flush all dirty pages in the area, such as PDD will be done with one write. We have been thinking about experimenting about flushing clean pages too if it allows to get more bulky writes. What I mean is mainly internal interaction in Innodb. The code change to push different dirty pages to the disk is a lot more simple and safe than understanding if it is safe to make Innodb to flush pages which are clean already.
Peter, what’s the potential failure mode of flushing a clean page? Naively it’d seem like a perfectly safe, albeit unnecessary, operation.
Patrick,
It is mostly code things. Innodb Is not designed to flush clean pages from buffer pool so we need to ensure we can do it properly. Also there are good questions like do clean pages need to go to double write buffer ? Probably not, if so how can we write them but skip from being put to double write buffer.
Dimitri,
There a lot parameters to play with, but we want to make it as tuning-less as possible.
One of parameters what is allowable to change is innodb_log_file_size, as it will
define balance between performance and recovery time.
So setting innodb_log_file_size as small as possible is not considering.
If go this way – we can just flush all pages as fast as possible, in fact I tried
this way of flushing, and performance is pretty stable.. around 5-10 NOTPM
> Then I’m curious:
>– why XtraDB “estimate†mode was not ok here?..
We still fall into async mode. algorithm is not able to keep up flushing with needed rate
>– are you sure there was no LRU flushing involving in parallel?..
I am pretty much sure. LRU flushing is active when we need to
replace dirty page to read new pages. as we have a lot of free pages
there, the LRU flushing is not working
> – what is impact when “neighbor flushing†is OFF / ON ?..
In some cases “neighbor flushing†OFF may help, but in fact
average throughput is worse than average throughput with ON.
> – did you try to test without O_DIRECT ?..
Nope
> – did you test with innodb_flush_log_at_trx_commit=1 or other?..
Nope