Swapping has always been something bad for MySQL performance but it is even more important for HA systems. It is so important to avoid swapping with HA that NDB cluster basically forbids calling malloc after the startup phase and hence its rather complex configuration.

Probably most readers of this blog know (or should know) about Linux swappiness setting, which basically controls how important is the file cache for Linux. Basically, with InnoDB, since the file cache is not important we add “vm.swappiness = 0” to “/etc/sysctl.conf” and run “sysctl -p” and we are done.

Swappiness solves part of the swapping issue but not all. With Numa systems, the picture is more complex and swapping can occur because of a memory imbalance between the physical cpus, the sockets and not cores. Jeremy Cole explained this here and here. In summary, you need to interleave the allocation of memory for the MySQL process using the numactl utility, drop the file cache and pre-allocate the innodb buffer pool with the innodb_buffer_pool_populate option.

That solves most of the swapping issues but… I recently ended up in a situation where, after having done all that, a server was swapping episodically for a few minutes, enough to cause serious issues to the pacemaker setup the server is one of the nodes. That caused resource failovers. Resource failover when there’s no fencing and when the servers are not responsive because of swapping is pretty bad for a cluster and we often ended up is strange, unexpected states. What was I missing?

In order to figure out, I started of data gathering metrics like: pt-stalk, vmstat, top, iostat, etc all saving to disk with the current time stamp in front of each line. I quickly found that the issues happened during the backups but also, during the push of the backups to S3. Why? The push to S3 was especially puzzling since MySQL was not involved. The server was running with nearly 7GB of free memory but the push to S3 involves a split operation to create files of 5GB since the backup size is about 28GB and the file upload size limit is 5GB. So, about 28GB was written to disk in a short while. Here’s the vmstat output during the split process, the cron job started at 10am.


Fri Mar 22 10:03:22 UTC 2013 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
Fri Mar 22 10:03:22 UTC 2013 r b swpd free buff cache si so bi bo in cs us sy id wa st
Fri Mar 22 10:03:22 UTC 2013 2 1 291992 224892 138848 7254284 0 0 148504 476 24190 24233 8 6 81 5 0
Fri Mar 22 10:03:24 UTC 2013 2 1 291992 222228 138848 7258064 0 0 194040 96 22443 20425 7 6 83 5 0
Fri Mar 22 10:03:26 UTC 2013 3 1 291992 224088 138852 7260196 0 0 172386 156 27143 27637 8 7 80 5 0
Fri Mar 22 10:03:28 UTC 2013 1 4 291992 219104 138832 7262820 0 0 174854 160 18893 17002 6 5 83 6 0
Fri Mar 22 10:03:30 UTC 2013 1 2 291992 324640 138836 7153440 0 0 143736 132 19318 17425 7 5 76 12 0
Fri Mar 22 10:03:32 UTC 2013 0 2 291984 292988 138840 7183556 0 0 138480 1206 19126 16359 3 5 81 12 0
Fri Mar 22 10:03:34 UTC 2013 4 0 291984 216932 138832 7255856 0 0 169072 936 20541 16555 3 7 83 8 0
Fri Mar 22 10:03:36 UTC 2013 3 7 300144 242052 138836 7216444 0 4080 53400 53714 18422 16234 2 7 72 18 0
Fri Mar 22 10:03:38 UTC 2013 0 4 395888 355136 138844 7197992 0 47872 3350 125712 24148 21633 3 4 70 23 0
Fri Mar 22 10:03:40 UTC 2013 4 4 495212 450516 138864 7208188 0 49664 2532 164356 30539 21120 2 3 81 13 0

Note the free column going down and the cache one going up and then swapping starting at 10:03:36. Also, the server is reading a lot (~150MB/s) but doing very little physical writes. The writes were cached and the write cache was causing the problem. It happened that the setting vm.dirty_ratio was set to 20 on a 48GB server, allowing nearly 10GB of “write cache”. A quick survey of the servers I have access to showed that the common values are 20 and 40. I simply set the dirty_ratio to 2 (1GB of write cache) and the issue is gone since.

So, add to your verification list the point of making sure you look at the Linux settings for the dirty pages. You can either set “vm.dirty_ratio” or “vm.dirty_bytes”, but keep in mind that only one is used; setting one to a non-zero value sets the other to zero.

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Alex

This is one great find! Thanks!

Dane

Nice write up. I’m not clear what was being swapped out starting at 10:03:36… Is Linux swapping out mysql malloc’d memory to make room for all those dirty buffers in the write cache?

Vojtech Kurka

We are using also:

vm.dirty_expire_centisecs = 200
vm.dirty_writeback_centisecs = 100

to make the i/o throughput more stable. The default timeout is quite long and can cause (micro) stalls when flushing a huge amount of data to disk/controller at once.

ethaniel

Setting vm.dirty_ratio to 2 can kill your system if a big write happens (like it just did to mine).

It’s better to set:
vm.dirty_ratio=20
vm.dirty_background_ratio=2

This way you can sustain a sudden surge of writes to up to 20% of your ram, while starting to write to disk at 2%.