In our recent release of Percona Server 5.5.19 we introduced new value for innodb_flush_neighbor_pages=cont.
This way we are trying to deal with the problem of InnoDB flushing.

Actually there is also the second fix to what we think is bug in InnoDB, where it blocks queries while it is not needed (I will refer to it as “sync fix”). In this post I however will focus on innodb_flush_neighbor_pages.

By default InnoDB flushes so named neighbor pages, which really are not neighbors.
Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as the result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes.
This is quite far from original intention to flush as many pages as possible in singe sequential write.

So we added new innodb_flush_neighbor_pages=cont method, with it, only really sequential write will be performed
That is case ...D...D...D..DDDPD....D....D...D....D only following pages will be flushed:
...D...D...D..FFFFF....D....D...D....D (marked as “F”)

Beside “cont”, in Percona Server 5.5.19 innodb_flush_neighbor_pages also accepts values “area” (default) and “none” (recommended for SSD).

What kind of effect does it have ? Let’s run some benchmarks.

We repeated the same benchmark I ran in Disaster MySQL 5.5 flushing, but now we used two servers: Cisco UCS C250 and HP ProLiant DL380 G6

First results from HP ProLiant.

Throughput graph:

Response time graph (axe y has logarithmic scale):

As you see with “cont” we are able to get stable line. And even with default innodb_flush_neighbor_pages, Percona Server has smaller dips than MySQL.

So this is to show effect of “sync fix”, let’s compare Percona Server 5.5.18 (without fix) and 5.5.19 (with fix).

You see that the fix helps to have queries running in cases when before it was “hard” stop, and no
transaction processed.

The previous result may give you impression that “cont” guarantees stable line, but unfortunately this is not always the case.

There are results ( throughput and response time) from Cisco UCS 250 server:

You see, on this server we have longer and deeper periods when MySQL stuck in flushing, and in such cases, the
innodb_flush_neighbor_pages=cont only helps to relief the problem, not completely solving it.
Which, I believe, is still better than complete stop for significant amount of time.

The raw results, scripts and different CPU/IO metrics are available from our Benchmarks Launchpad

10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Olivier Doucet

Hi Vadim,
Interesting tests as always. Just to share my own benchmarks : this problem can also be seen when storage is fully in memory (using /dev/shm on a Linux server with 128GB of memory). So stalls are definitely not due to the storage itself, but to the InnoDB engine. To my mind, this is a proof that something can be done about this problem 🙂
What was the CPU usage you observed ? Is it the same when stalls happens ?

Peter Zaitsev

Vadim,

Great results! I assume this means this change does fix problem for some workloads but does not fix it completely (though makes it better) for others. Indeed it is possible to create the case when it is impossible for server to keep up with flushing of dirty pages as they can be dirtied at faster rate than flushed. The only way to keep uniform performance would be to throttle rate at which pages can be made dirty to match a speed with which disk can write.

You also call “D….D….D…D” flushes a random write which is not the case as writes are not completely random – they are likely to be physically close and as such no disk seeks is likely to be required (for single drive). Yet it is still far cry from sequential write performance. From what I tested on conventional drives it looks like performing 4 such writes with holes between them makes hard drive to do only one write per rotation which is very expensive.

Henrik Ingo

Vadim: So it is the same benchmark and two different servers? Could you then summarize what is different in the 2 servers that might cause this?

Henrik Ingo

Ok, so you have fixed the problem for the HP gear, but insert twice as fast server (let’s assume disk performance is roughly the same) and the same problem comes back. Makes sense.

So it’s a battle you can’t win without giving up on disks… or at least having really powerful disk subsystem.

Ryan S

Mr Tkachenko,

This might be of interest
http://blogs.innodb.com/wp/2012/04/new-flushing-algorithm-in-innodb/

Info on O_DSYNC being faster than O_DIRECT
http://dba.stackexchange.com/questions/1568/clarification-on-mysql-innodb-flush-method-variable/1575#1575

Even though mysql 5.6 is in development release still I went here to check out the improvements in mysql 5.6
http://dev.mysql.com/tech-resources/articles/whats-new-in-mysql-5.6.html

These improvements sound interesting to page flush and i/o problems?

Explicit Partition Selection, Split Kernel Mutex, Multi-Threaded Purge, Separate Flush Thread, Pruning the InnoDB Table Cache

This article was for scaling inserts so I dont know if it will help for Selects and Updates
http://mysqlopt.blogspot.com/2012/06/mysql-how-to-scale-inserts.html

[Quote]
“However, you can reach better through put (inserts per second) with less threads. So after dropping number of threads on both clusters by 50% initially – taking it to 20-20 sessions. The problem almost disappeared and when we further reduced number of threads to 10-10 sessions, the problem disappeared!”

“Also writing the redo logs, binary logs, data files in different physical disks is a good practice with a bigger gain than server configuration” [End Quote]

This is another article on log file sizes and stalls/lockups, even though the lock issue might be solved for this I still
think the idea can be used to solve other problems with flushing

http://www.mysqlperformanceblog.com/2012/05/24/binary-log-file-size-matters/

[Quote]
“Here’s what we did: we have reduced the size of binary log file from default 1GB (some systems have it set to 100MB in their default my.cnf) down to 50MB and we never saw this problem ever again. Now the files were removed much more often, they were smaller and didn’t take that long to remove.”

‘Note that on ext4 and xfs you should not have this problem as they would remove such files much faster”
[End Quote]

innodb_flush_log_at_trx_commit = 0 (this seems to help a lot with write performance)

The flush problem seems to look like a chasing our tails problem.

Use more eager writing in parellel instead of lazy writng to disk and set lower dirty ratio to start flushing,
terminate worker flushing threads at a faster rate.

There should be better ways of allowing read and writes without locking ie. a special algorithm.

I think something like the Deadline scheduler would help because it imposes a time limit on I/O request so flushing would not have to wait on stalled read/writes.
I believe the Deadline scheduler is not the default schedualer of linux but was recommended for ext4.

If they made an SSD as fast as Ram or if one was to raid 10 a bunch of SSD’s then we could put the Innodb straight to SSD and not even use any Ram and it would remove this problem?

Including/improving some sort of cache to disk not ram for mysql like varnish for http webserver should speed up mysql since its already technically on the disk.

Also btw does anyone know when percona 5.6 is going to be released or if I should switch to mysql 5.6 for the flushing and other benfits?

Ryan S

I forgot to add this

The fractal tree idea sound like it would help increase throughput.

Use lsm tree for insert/merges and b-tree for search queries, this would be a hybrid setup
with superindex between the 2 to balance cheap inserts in lsm to b-tree searches.

The block sizes can be increased which would make fragment clean up on the block easier to do
or fragment not happening at all.

Alfie John

Was innodb_flush_neighbor_pages removed from 5.6? 5.6.12-60.3 says it’s an unknown variable.

Hrvoje Matijakovic

Hi Alfie,

Yes, that variable has been removed. Improved InnoDB I/O Scalability patches have been replaced by improvements and changes in MySQL 5.6, although Percona may make improvements in the future.

This variable has been replaced by the upstream variable innodb_flush_neighbors (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_neighbors).