It has taken a years to get a proper integration between operating system kernel, device driver and hardware to get behavior with caches and IO modes correctly. I remember us having a lot of troubles with fsync() not flushing hard drive write cache and so potential hard drives can be lost on power failure. Happily most of these are resolved now with “real hardware” and I’m pretty confident running Innodb with both default (fsync based) or O_DIRECT innodb_flush_method. Virtualization however adds yet another layer and we need to question again whenever IO really durable in virtualized environments. My simple testing shows this may not always be the case
I’m comparing O_DIRECT and fsync() single page writes to 1MB file using SysBench on Ubuntu, ext4 running on VirtualBox 4.0.4 running on Windows 7 on my desktop computer with pair of 7200 RPM hard drives in RAID1. Because there is no write cache I expect it to do no more than a bit over 100 writes per second as even in case there is no disk seek we need to wait for disk head to make a full round to do a rotation. I’m however getting rather bizarre results:
Using fsync()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | pz@ubuntu:~/test$ sysbench --num-threads=1 --test=fileio --file-num=1 --file-test-mode=rndwr --file-total-size=1M --max-requests=10000000 --max-time=60 --file-fsync-freq=1 run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 0 1 files, 1Mb each 1Mb total file size Block size 16Kb Number of random requests for random IO: 10000000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 1 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Time limit exceeded, exiting... Done. Operations performed: 0 Read, 1343 Write, 1343 Other = 2686 Total Read 0b Written 20.984Mb Total transferred 20.984Mb (357.62Kb/sec) 22.35 Requests/sec executed Test execution summary: total time: 60.0863s total number of events: 1343 total time taken by event execution: 0.0808 per-request statistics: min: 0.04ms avg: 0.06ms max: 0.34ms approx. 95 percentile: 0.06ms Threads fairness: events (avg/stddev): 1343.0000/0.00 execution time (avg/stddev): 0.0808/0.00 |
Ignore response times here as it times only writes not fsync() calls…. 22 fsync requests per second is pretty bad though I assume It can be realistic with overhead.
Now lest see how it looks using O_DIRECT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | pz@ubuntu:~/test$ sysbench --num-threads=1 --test=fileio --file-num=1 --file-test-mode=rndwr --file-extra-flags=direct --file-total-size=1M --max-requests=10000000 --max-time=60 run sysbench 0.4.10: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 1 Extra file open flags: 16384 1 files, 1Mb each 1Mb total file size Block size 16Kb Number of random requests for random IO: 10000000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 100 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random write test Threads started! Time limit exceeded, exiting... Done. Operations performed: 0 Read, 33900 Write, 339 Other = 34239 Total Read 0b Written 529.69Mb Total transferred 529.69Mb (8.8278Mb/sec) 564.98 Requests/sec executed Test execution summary: total time: 60.0019s total number of events: 33900 total time taken by event execution: 37.5364 per-request statistics: min: 0.10ms avg: 1.11ms max: 259.69ms approx. 95 percentile: 5.31ms Threads fairness: events (avg/stddev): 33900.0000/0.00 execution time (avg/stddev): 37.5364/0.00 |
I would expect rather similar results to the test with fsync() while we’re getting numbers 20 times better… and surely too good to be true. Meaning I can be sure the system is lying about write completion if we’re using O_DIRECT IO
What is my take away on this ? I did not have a time to research whenever the problem is related to VirtualBox or some configuration issue. Things may be working correctly in your case. The point is Virtualization adds complexity and there are at least some cases when you may be lied to about IO completion, so if you’re relying on system to be able to recover from power failure or VM crash make sure to test it carefully.
This one of those reasons why I often recommend not virtualizing high volume database servers. It makes me feel like an old fuddy-duddy whenever I say it, because I really do think virtualization represents the future, but its so opache.
With physical hardware, I can understand *exactly* what the hardware is capable of, what it is doing, and what i can expect out of it. Once virtualized, there’s a host of other cofactors at work that can impact performance and throughput. To add insult to injury, most of these secondary factors can’t be monitored from within the virtual machine (the matrix problem), and a lot of the tools within the instance will give erroneous or misleading results because of the virtualization layer.
I’d virtualize lab machines or low volume production databases without a concern in the world, but with a busy production box I still prefer to run on bare metal.
By default Virtualbox virtual machines ignore fsync requests. You can make it honor them though:
http://www.virtualbox.org/manual/ch12.html#id411470
If you are using VMware, then there is no way to guarantee flushes unless the HOST operating system is Microsoft Windows, or you use ESX server. If you use a Linux host then the virtual machines (Linux or otherwise) won’t having flushing guarantees.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008542
What are the implications here for a Linux/Amazon EC2 instance?
Any virtual server load testing in my opinion isn’t worth spending time on unless it’s on ESX (at least 3.5 or newer), and on a fibre channel SAN(not iSCSI, not local storage), use a decent queue depth too the default seem to be way too low (depending on your array of course). I use raw device maps on all of my databases so I can use SAN-controlled snapshots, I am pretty sure that takes out some extra overhead associated with VMFS as well(not my priority though, I want the SAN snapshots).
In fact testing on a poor virtualization platform may actually be bad, because it may scare people away from virtualization when a good platform with solid hardware really runs quite well when configured right (VMware is so easy to configure, it’s not too uncommon for inexperienced people to horribly mis-configure it for example massively over committing cpu cores with lots of SMP VMs or massively over committing memory causing swapping).
Nate,
I’m not doing any performance benchmarks here. I’m just noticing the IO speeds which are too good to be true, so at least in some cases you can get data loss in case of crash.
Richard – I can’t say anything about Amazon EC2 based on these tests. The problem with Amazon EC2 is you can’t really test what happens in case of power failure on the physical box. though most deployments assume instance is dead in such cases anyway and switch to replica or something
Patrick,
Yeah. I think there is nobody is saying what virtualization will Increase performance. All discussions and work is rather on reducing overhead.
I kind of put Virtualization in the same bucket as SAN. There are reasons to use it, such as convenience, manageability, efficient use of resources, just do not put performance in the list.
Patrick,
Yeah. I think there is nobody is saying what virtualization will Increase performance. All discussions and work is rather on reducing overhead.
I kind of put Virtualization in the same bucket as SAN. There are reasons to use it, such as convenience, manageability, efficient use of resources, just do not put performance in the list.
Nate,
I’m not doing any performance benchmarks here. I’m just noticing the IO speeds which are too good to be true, so at least in some cases you can get data loss in case of crash.
Richard – I can’t say anything about Amazon EC2 based on these tests. The problem with Amazon EC2 is you can’t really test what happens in case of power failure on the physical box. though most deployments assume instance is dead in such cases anyway and switch to replica or something
Any virtual server load testing in my opinion isn’t worth spending time on unless it’s on ESX (at least 3.5 or newer), and on a fibre channel SAN(not iSCSI, not local storage), use a decent queue depth too the default seem to be way too low (depending on your array of course). I use raw device maps on all of my databases so I can use SAN-controlled snapshots, I am pretty sure that takes out some extra overhead associated with VMFS as well(not my priority though, I want the SAN snapshots).
In fact testing on a poor virtualization platform may actually be bad, because it may scare people away from virtualization when a good platform with solid hardware really runs quite well when configured right (VMware is so easy to configure, it’s not too uncommon for inexperienced people to horribly mis-configure it for example massively over committing cpu cores with lots of SMP VMs or massively over committing memory causing swapping).
What are the implications here for a Linux/Amazon EC2 instance?
By default Virtualbox virtual machines ignore fsync requests. You can make it honor them though:
http://www.virtualbox.org/manual/ch12.html#id411470
If you are using VMware, then there is no way to guarantee flushes unless the HOST operating system is Microsoft Windows, or you use ESX server. If you use a Linux host then the virtual machines (Linux or otherwise) won’t having flushing guarantees.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008542
This one of those reasons why I often recommend not virtualizing high volume database servers. It makes me feel like an old fuddy-duddy whenever I say it, because I really do think virtualization represents the future, but its so opache.
With physical hardware, I can understand *exactly* what the hardware is capable of, what it is doing, and what i can expect out of it. Once virtualized, there’s a host of other cofactors at work that can impact performance and throughput. To add insult to injury, most of these secondary factors can’t be monitored from within the virtual machine (the matrix problem), and a lot of the tools within the instance will give erroneous or misleading results because of the virtualization layer.
I’d virtualize lab machines or low volume production databases without a concern in the world, but with a busy production box I still prefer to run on bare metal.