I was working with the customer today investigating MySQL over DRBD performance issues. His basic question was why there is so much overhead with DRBD in my case, while it is said there should be no more than 30% overhead when DRBD is used.

The truth is – because how DRBD works it does not adds static overhead which could be told as 10% or 80% and you really need to understand how DRBD works as well as how IO system is utilized to understand how much overhead you should expect.

First lets talk what kind of IO you performance you care about while running MySQL over DRBD. Your reads are going to be serviced from local hard drive and it is only writes which suffer overhead of DRBD.
If you’re using MySQL with Innodb (and running MyISAM with DRBD makes little sense anyway) you will have to care about background random IO coming from buffer flush activity – which is typically not latency critical and rarely the problem and log writes, which is critical and latency sensitive.

What many people do not realize is MySQL has to deal with multiple logs and perform multiple flushes if operating in maximum safety configuration, which is what DRBD users often prefer as if you can avoid transaction loss why bother with DRBD at all and not use MySQL replication instead.

So assuming you have innodb_flush_log_at_trx_commit=1 and sync_binlog=1 you will have 4 “sync” operations in MySQL 5.0 – there is event written to the binary log to “prepare” XA transaction, when there is record in Innodb log to commit it and finally there is record in Binary log to commit transaction followed by commit transaction in Innodb storage engine.

Moreover these operation could cause (depending on file system and configuration) more synchronous IO operations – MySQL binary logs are not preallocated as Innodb logs are which means both Data AND metadata have critical changes on each binary log write and both have to be synced to avoid data loss. Though in case of “data journaling” this can be done with single log write with actual modifications performed parallel.

Anyway the point is there is a lot of synchronous writes which are fully serialized because group commit in Innodb was broken in 5.0 and it is still not fixed in 5.1 to date. So this is the access pattern which is often going to define your MySQL on DRBD Performance.

Lets now see how DRBD works so we can analyze how much overhead we should expect. In the case of single outstanding synchronous request it is pretty easy.

When request goes to DRBD device in additionally to be performed locally it has to be performed on remote device, which means sending the block over network, executing it on the remote node and responding with ACK, assuming the DRBD is configured with maximum durability settings.

So overhead can vary a lot depending on the speed of the disk subsystem and network.

If you do not have BBU on disk when you will be able to do up to 200 serialized synchronous IO operations per second, meaning each operation will take about 5000 microseconds. At the same time using gigabit network will give you round trip size to send 4096 bytes of data (typical block size for filesystem) and return ACK packet will be 200 microseconds.

With such case even considering extra overhead besides network IO we’re speaking about 300 microseconds vs 5000 microseconds and DRBD overhead can be well below 10%

The problem is however such configuration will likely have extremely poor performance because of amount of synchronous operations required – which we counted to be 4 per transaction commit, or could be more than 6 depending on how Filesystem does it job. Rates of 40-50 transactions per second are not encouraging for many applications.

My typical advice – if you want to have things highly durable and performing well at the same time you must have BBU (Battery backed up unit) on your hardware RAID card or something which has same effect, especially as it become pretty cheap these days.

With good RAID as I benchmarked you can be getting over 10000 req/sec in-cache write speed (and this is what a lot of transaction and binary logs are) – in this case you have about 100 microsecond for request execution while DRBD overhead remains at 200-300 microseconds for requests. This means DRBD slow down things 3 times+ writes.

There is nothing wrong with DRBD, it is great piece of software and it is not what it is running slow but what the relative performance between system components is what is causing such overhead.

If you’re looking for less overhead for DRBD with fast storage (ie BBU) you should be looking at low latency network communications.

It would be actually rather interesting to see DRBD to have direct support for Infiniband or Dolphin interconnect sockets which are getting low cost these days and which could offer significant performance improvements for DRBD. Though you should be already be able to use these using standard TCP/IP communication, which already makes things a lot faster than 1Gb Ethernet.
Though this is only a theory – I have not had a chance to play with DRBD on this kind of networks yet.

As a summary in this case we investigated the system and the “surprising” overhead of DRBD perfectly matched system components performance capacity.
As a lesson – do not take the overhead as a number but learn where this overhead comes from so you can find how much would it be for your system.

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Florian Haas

Peter,

it is, as always, important to point out that “performance” in I/O is a combination of two distinct things that may affect applications in very different ways: throughput and latency. Your post talks mostly about latency, so that’s what I am going to tackle here.

Consider a local disk with about 3ms latency for a small (one sector, 512 bytes) write request. Consider further that with Gigabit Ethernet, we get about .1 to .3ms network latency. DRBD’s latency penalty amounts to roughly twice the network latency, so that would be .2 to .6ms or about 7 to 20% latency overhead.

Now when you introduce a RAID controller with battery-backed write cache to the setup (and you should), your local I/O latency decreases dramatically. Now your local writes “complete” in something like .1 to .3ms. So in relation, all of a sudden DRBD’s latency penalty “jumps up” (in percentage, not in absolute terms) to perhaps 100% or 300%. And there’s little you can do about this, because it’s almost impossible to tune the Gigabit Ethernet-based TCP/IP stack to deliver packets significantly faster.

It should be noted that unless your application is highly write intensive, you’ll probably not notice much of this. However, for extremely high write loads, DRBD currently does add comparatively large performance penalties percentage-wise. Which is subject to change when we move off Ethernet-based IP as our sole network protocol. Some of the low-latency network technologies you mention are already being worked on — stay tuned on the drbd-user mailing list for announcements.

And just in case other readers here are interested in learning more about DRBD performance tuning considerations, I encourage you to watch my blog (http://blogs.linbit.com/florian) for a free DRBD performance tuning webinar we will be announcing shortly.

Cheers,
Florian

Rob

Hi Peter,

I do have some experience of running DRBD over Infiniband. We used 3rd Gen Mellanox PCI-E 20Gb Infiniband in conjuntion with IPOIB-CM with a 32Kb MTU (Useful for other use, not DRBD). I get total synch rate of around 300Mbytes/sec with multiple arrays synching (This is limited by the arrays, tcp/ip performance manages about 700Mbytes/sec). Its not really bandwitch im interested in though, its random write performance which seems to be affected somewhat by DRBD.

RDMA (Used by Infiniband and IWarp NICs) would likey greatly help reduce additional DRBD write latency (More important to most people) and increase usable bandwidth to about 1400Mbyte/second in each direction on current 4th gen 20Gb Infiniband cards.

I was told by Linbit 6 months ago that RDMA support was only on the maybe list for DRBD 9 and support for RDMA was quite a long way away. Have you heard any additional info on RDMA support in DRBD?

Venkata Krishnan

Hello Peter,

Great blog – I enjoyed reading it. BTW, I am following up on an old posting of yours “How much overhead DRDB could cause?”.

You have mentioned in the blog and I quote…
……….
“If you’re looking for less overhead for DRBD with fast storage (ie BBU) you should be looking at low latency network communications.

It would be actually rather interesting to see DRBD to have direct support for Infiniband or Dolphin interconnect sockets which are getting low cost these days and which could offer significant performance improvements for DRBD. Though you should be already be able to use these using standard TCP/IP communication, which already makes things a lot faster than 1Gb Ethernet.
Though this is only a theory – I have not had a chance to play with DRBD on this kind of networks yet.

……….

I would like to post an update that we have used DRBD with Dolphin Interconnect/Fast Storage (using our new PCI Express based flash storage product called StorExpress). The combination of Dolphin’s StorExpress and its low latency/high bandwidth communication model using Dolphin Express interconnect/SuperSockets software provides an ideal environment for supporting an extremely efficient replicated block storage mechanism. As a result, we have seen around 50% improvement in performance over a 10 Gigabit Ethernet solution for a wide ranging set of block sizes. Dolphin Interconnect along with StorExpress based storage provides the best replication performance for DRBD.

For more details, please see
http://www.dolphinics.com/products/storexpress.html
http://www.dolphinics.com/solutions/dolphin_express_drbd_speedup_linbit.html

Thanks.