August 22, 2014

Faster MySQL failover with SELECT mirroring

One of my favorite MySQL configurations for high availability is master-master replication, which is just like normal master-slave replication except that you can fail over in both directions. Aside from MySQL Cluster, which is more special-purpose, this is probably the best general-purpose way to get fast failover and a bunch of other benefits (non-blocking ALTER TABLE, for example).

The benefit is that you have another server with all the same data, up and running, ready to serve queries. In theory, it’s a truly hot standby (stay with me — that’s not really guaranteed). You don’t get this with shared storage or DRBD, although those provide stronger guarantees against data loss if mysqld crashes. And you can use the standby (passive) master for serving some SELECT queries, taking backups, etc as usual. However, if you do this you actually compromise your high-availability plan a little, because you can mask the lack of capacity that will result when one of the servers is down and you have to rely on just one server to keep everything on its feet.

If you need really high availability, you can’t load the pair of servers more than a single server can handle. (You can always use the passive server for non-essential needs — it doesn’t have to be completely dead weight.) As a result, some people choose to make the passive server truly passive, handling none of the application’s queries. It just sits there replicating and doing nothing else.

The problem is that the passive server’s caches start to get skewed to handle the write workload from replication, and not the read workload it will have to handle if there’s a planned or unplanned failover. This isn’t a big problem on small systems, but with buffer pools in the dozens of gigabytes (which is arguably “small” these days), it starts to matter a lot. Warming up a system so it’s actually responsive can take hours. As a result, the passive master isn’t truly hot anymore. It needs to handle the workload it’s supposed to be ready to take over. If you fail over to it, it might perform very badly — get unresponsive, cause tons of I/O, etc. In reality, it can be completely unusable for a long time.

To measure how much this really matters, I did some tests for a customer who was having troubles with this type of scenario. I used mk-query-digest (with some new features) to watch the traffic on the active master and replay SELECT queries against the passive one. I timed the results and ran them through the analysis part of mk-query-digest. A simple key lookup ran in tens of milliseconds on the active master, but executed for up to dozens of seconds on the passive one.

After a couple of hours of handling SELECT traffic, these same queries were responding nicely on the passive master, too.

Is that all? “Buffer pool warmed up, performance is better, case closed!” No. This isn’t as simple as it sounds on the surface. There are two things happening and both are important to understand.

The first, most obvious phenomenon is that the buffer pool gets skewed to handle the write workload. Since we’re running Percona’s patched server, we can actually measure what’s in the buffer pool. I measured the active master’s buffer pool with the following query:

I loaded this file into a table on my laptop with LOAD DATA INFILE and kept it for later. I did the same on the slave. Then I used mk-query-digest to watch the traffic on the active master:

After a bit I CTRL-C’ed it and it printed out the analysis of the time taken to run the queries against the passive master. I restarted it and after a few hours of this I did the same thing; the query timings were dramatically better now. Then I just let it keep running without any aggregation options to avoid any overhead of storing and analyzing queries. (I added –mirror and –daemonize options so it can run in the background and follow along when the passive/active roles switch.)

After a day or so of doing this, I re-sampled the buffer pool contents on the passive server. With all three samples stored in tables on my laptop, I wrote a query against these three sets of stats to find the top tables on the active server and left-join those against the tables on the passive server, with both a mixed workload from my mirrored SELECT statements and with the “pure” replication workload. I totaled the pages up into gigabytes. Here’s the result:

db_table active passive + SELECT passive
site.benefits 8.30 5.73 1.32
. 3.13 0.94 0.50
site.user_actions 2.55 4.09 6.29
site.user_achievements 1.36 1.20 0.35
site.clicks 1.26 3.05 5.13
site.actions_finished 1.14 0.46 0.74
site.ratings 0.91 0.89 0.48

The difference is clear. The buffer pool contains over 8G of data for the site.benefits table on the active master, but if you just put a replication workload on the server, that falls to 1.32G. Other tables are similar. The mixed workload with some SELECT queries mirrored is somewhere between the two.

One thing we don’t know is which pages are in the pool. Same table, same size of data doesn’t mean same buffer pool contents. An insert-only workload will probably fill the buffer pool with the most recent data; a mixed workload will usually have some different hot spot or mixture of hot spots, so it’ll bring different parts of the table into memory.

So that’s the first thing that’s happening. The second is the insert buffer. Notice the pages with no database or table name — the second row in the table above. Those are a mixture of things, but it’s overwhelmingly the insert buffer.

As Peter explained in his recent post on the insert buffer, the other thing the SELECTs do is keep the insert buffer in a production steady-state. The buffered records are forced to be merged by the SELECTs, and a lot more of the pages from the insert buffer are in the buffer pool, not on disk. So it’s not just the buffer pool that gets skewed with a write-only workload! The insert buffer can also cause terrible performance. There are some subtleties about exactly what’s happening that I’m still investigating and may write more about later, in this particular case.

So what can we conclude from this? Simply this: if you have a standby server that’s not under a realistic workload, you won’t be able to get good performance after a failover. You need to use some technique to mirror the read-only workload to the passive server. It doesn’t have to be the tools I used — it could be MySQL Proxy or a TCP sniffer or anything else. But if you need fast failover, you need some way to at least partially emulate a production workload on the standby machine.

PS: I see Robert Hodges just published an article on warm standby for PostgreSQL. Link love for interested readers.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Baron. where to read about master-master MySQL replication in details?
    We ran into problems with master-master replication in case of nested sets storage of hierarchical data.

  2. You can read about it in our book High Performance MySQL Second Edition (soon to be translated to Russian, btw!). Honestly I think it’s the best source of practical, sound advice. There is a lot of bad information on the Internet.

    However, I don’t see any connection between m-m replication and nested sets, so maybe you have several things you need to think about. It sounds like a complex topic.

  3. Sebastian says:

    Do you actually recommend m-m replication?
    I tried it when it was “fresh” and a lot more hackish a few year back:
    it never performed reliable. Most important for me were my scripts to get it back up fast…
    Is it safe to use it these days under heavy load with MyISAM-tables?

    thanks ;-)

  4. pat says:

    Sebastian,

    I know of only three ways to get a consistent backup of myisam suitable for a replication seed. Maybe one of the guys here has a fourth (in which case I’d be curious myself), but here’s the three I know of:

    1) Stop the database server, copy the data directory (this is an outage)
    2) use mysqldump with the lock all tables option (this is an outage)
    3) if you’re using LVM, FLUSH TABLES WITH READLOCK … snap … RELEASE LOCK (this is a very brief outage)

    If you’re using innodb, you can just do a dump with –single-transaction in the flags and all should be well.

    We run a site with a large number of myisam tables that we’re seriously considering moving to innodb; myisam has a lot of issues around the whole concept of 100% availability.

  5. Sebastian,

    You should check out MMM.

    A very reliable daemon for running a highly available master-master cluster.

    Oren

  6. Ron says:

    Doesn’t flushing the buffer pool fix the problem for passive standby’s?

  7. Ron, I’m not sure what you mean.

  8. Hi Baron,

    This is an excellent article and real food for thought. Your workload simulation idea using a sniffer looks very tractable for a couple of reasons. First, the basic problem with read/write splitting is making an up-front decision when it’s safe to dispatch synchronously to a slave. Here we don’t care about the result of the query or even very greatly when it actually runs. Second, it seems it would not be necessary to run all queries or even a lot of them: a representative distribution should be sufficient to load popular pages into cache.

    If we could get a subset of sniffed packets delivered as consistent tcp/ip sessions I think we could adapt our Myosotis proxy to perform this kind of cache heating operation. It could be a simple add-on that has little or no impact on underlying HA mechanisms (i.e., does not make them more complex).

    Cheers, Robert

    p.s, thanks for the link

  9. @pat, there is a 4th way which also minimizes downtime: use rsync -avz.

    Steps:
    1) Use rsync -avz to copy the data folder to the new slave. This should take a long time and it will likely complain files change during read, which is fine.
    2) Repeat step 1. This should take considerably less time; basically only transfers the data that has changed during the initial copy.
    3) Shut down master mysqld and repeat step 1 again.
    4) Start master and slave mysqld.

    On small dbs step 2 can be skipped since very little is expected to have changed during initial copy, but on larger dbs it is beneficial to do two live synchronizations.

Speak Your Mind

*