I talk with lot of people who are really interested in Percona XtraDB Cluster (PXC) and mostly they are interested in PXC as a high-availability solution.  But, what they tend not to think too much about is if moving from async to synchronous replication is right for their application or not.

Facts about Galera replication

There’s a lot of different facts about Galera that come into play here, and it isn’t always obvious how they will affect your database workload.  For example:

  • Transaction commit takes approximately the worst packet round trip time (RTT) between any two nodes in your cluster.
  • Transaction apply on slave nodes is still asynchronous from client commit (except on the original node where the transaction is committed)
  • Galera prevents writing conflicts to these pending transactions while they are inflight in the form of deadlock errors.  (This is actually a form of Eventual Consistency where the client is forced to correct the problem before it can commit.  It is NOT the typical form of Eventual Consistency, known as asynchronous repair, that most people think of).

Callaghan’s Law

But what does that all actually mean?  Well, at the Percona Live conference a few weeks ago I heard a great maxim that really helps encapsulate a lot of this information and puts it into context with your application workload:

[In a Galera cluster] a given row can’t be modified more than once per RTT

This was attributed to Mark Callaghan from Facebook by Alexey Yurchenko from Codership at his conference talk.  Henceforth this will be known as “Callaghan’s law” in Galera circles forever, though Mark didn’t immediately recall saying it.

Applied to a standalone Innodb instance

Let’s break it down a bit.  Our unit of locking in Innodb is a single row (well, the PRIMARY KEY index entry for that row).  This means typically on a single Innodb node we can have all sorts modifications floating around as long as they don’t touch the same row.  Row locks are held for modifications until the transaction commits and that takes an fsync to the redo log by default, so applying Callaghan’s law to single-server Innodb, we’d get:

[On a single node Innodb server] a given row can’t be modified more than the time to fsync

You can obviously relax that by simply not fsyncing every transaction (innodb_flush_log_at_trx_commit != 1), or work around it with by fsyncing to memory (Battery or capacitor-backed write cache), etc., but the principle is basically the same.  If we want this transaction to persist after a crash, it has to get to disk.

This has no effect on standard MySQL replication from this instance, since MySQL replication is asynchronous.

What about semi-sync MySQL replication?

It’s actually much worse than Galera.  As I illustrated in a blog post last year, semi-sync must serialize all transactions and wait for them one at a time.  So, Callaghan’s law applied to semi-sync is:

[On a semi-sync replication master] you can’t commit (at all) more than once per RTT. 

Applied to a Galera cluster

In the cluster we’re protecting the data as well, though not by ensuring it goes to disk (though you can do that).  We protect the data by ensuring it gets to every node in the cluster.

But why every node and not just a quorum?  Well, it turns out transaction ordering really, really matters (really!).  By enforcing replication to all nodes, we can (simultaneously) establish global ordering for the transaction, so by the time the original node gets acknowledgement of the transaction back from all the other nodes, a GTID will also (by design) be established.  We’ll never end up with non-deterministic ordering of transactions as a result.

So this brings us back to Callaghan’s law for Galera.  We must have group communication to replicate and establish global ordering for every transaction, and the expense of doing that for Galera is approximately one RTT between the two nodes in the cluster that are furthest apart (regardless of where the commit comes from!).  The least amount of data we can change in Innodb at a time is a single row, so the most any single row can be modified cluster-wide is once per RTT.

What about WAN clusters?

Callaghan’s law applies to WAN clusters as well.  LANs usually have sub-millisecond RTTs.  WANs usually have anywhere from a few ms up to several hundred.  This really will open a large window where rows won’t be able to be updated more than just a few times a second at best.

Some things the rule does not mean on Galera

  • It does NOT mean you can’t modify different rows simultaneously.  You can.
  • It does NOT mean you can’t modify data on multiple cluster nodes simultaneously.  You can.
  • It does NOT set an lower bound on performance, only a upper bound.  The best performance you can expect is modifying a given row once per RTT, it could get slower if apply times start to lag.

So what about my application?

Think about your workload.  How frequently do you update any given row?  We call rows that are updated heavily “hotspots“.

Examples of hotspots

Example 1: Your application is an online game and you keep track of global achievement statistics in a single table with a row for each stat; there are just a few hundred rows.  When a player makes an achievement, your application updates this table with a statement like this:

How many players might accomplish this achievement at the same time?

Example 2: You have users and groups in your application.  These are maintained in separate tables and there also exists a users_groups table to define the relationship between them.  When someone joins a group, you run a transaction that adds the relationship row to users_groups, but also updates groups with some metadata:

How often might multiple users join the same group?

Results

In both of the above examples you can imagine plenty of concurrent clients attempting to modify the same record at once.  But what will actually happen to the clients who try to update the same row within the same RTT?  This depends on which node in the cluster the writes are coming from:

From the same node: This will behave just like standard Innodb.  The first transaction will acquire the necessary row locks while it commits (which will take the 1 RTT).  The other transactions will lock wait until the lock(s) they need are available.  The application just waits in those cases.

From other nodes: First to commit wins.  The others that try to commit AFTER the first and while the first is still in the local apply queue on their nodes will get a deadlock error.

So, the best case (which may not be best for your application database throughput) will be more write latency into the cluster.  The worst case is that your transactions won’t even commit and you have to take some action you normally wouldn’t have had to do.

Workarounds

If your hotspots were really bad in standalone Innodb, you might consider relaxing the fsync:  set innodb_flush_log_at_trx_commit to something besides 1 and suddenly you can update much faster.  I see this tuning very frequently for “performance” reasons when data durability isn’t as crucial.  This is fine as long as you weigh both options carefully.

But in Galera you cannot relax synchronous replication.  You can’t change the law, you can only adapt around it, but how might you do that ?

Write to one node

If your issue is really the deadlock errors and not so much the waiting, you could simply send all your writes to one node.  This should prevent the deadlock errors, but will not change the lock waiting that your application will need to do for hotspots.

wsrep_retry_autocommit

If your hotspots are all updates with autocommits, you can rely on wsrep_retry_autocommit to auto-retry the transactions for you.  However, each autocommit is retried only the number of times specified by this variable (default is 1 retry).  This means more waiting, and after the limit is exceeded you will still get the deadlock error.

This is not implemented for full BEGIN … COMMIT multi-statement transactions since it cannot be assumed that those are not applying application logic in between the statements that is not safe to retry after the database state changes.

retry deadlocks

Now we start to get into (*gasp*) territory where your application needs to be modified.  Generally if you use Innodb, you should be able to handle deadlock errors in your application.  Raise your hands if your application has that logic (I usually get less than 5 people who do out of 100).

But, what to do?  Retrying automatically, or giving your end user a chance to retry manually are typical answers.  However, this means more latency waiting for a write to go through, and possibly some poor user experience.

batch writes

Instead of updating global counters one at a time (from Example 1, above), how about maintaining the counter in memcache or redis and only flushing to the database periodically?

change your schema

In Example 2, above, how above moving the ‘joined’ column to the users_groups table so we don’t need to update the parent group row so often?

Conclusion

Choosing a system to replicate your data to a distributed system requires tradeoffs.  Most of us are used to the tradeoffs we take when deploying conventional stand-alone MySQL Innodb with asynchronous slaves.  We may not think about the tradeoffs, but we’re making them (anyone obsessively testing slave position to ensure it’s caught up with the master?).

Synchronous replication with PXC and Galera is no different in that there are trade-offs, they just aren’t what we commonly expect.

If Callaghan’s law is going to cause you trouble and you are not prepared to adapt to work with it, PXC/Galera Synchronous replication is probably not right for you.

19 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mark Callaghan

Excellent description. I have been waiting for someone to begin writing about Galera at this level of detail.

I have not reviewed or tested semi-sync in official MySQL so I am surprised by this and don’t understand why it occurs. But my peers are interested in semi-sync now so we will verify this.
“Semi-sync performs dismally under concurrency. Think about it, our single inserts took ~100ms, so we are effectively serializing our 32 clients with semi-sync replication. Apparently each client must wait for it’s turn writing to the binlog and waiting for confirmation from the slave, there can be no parallelization here.”

Peter Zaitsev

Jay,

Great article !

Couple of comments. I think it is important to point out in your case you’re speaking about “commits” when it comes to modification, you can perfectly modify same row many times per RTT assuming you commit all these changes with one go. Some other solutions might actually cause RTT penalty for each update.

Second I think as the technology evolves we should see more workarounds. Queuing and batching technologies can be very interesting used together with sync replication. The reducing the hot spots also can often be used by using multiple rows (say 10) where you modify different rows to reduce contention… but your point lookup queries later have to be aggregates ie running SUM() or MAX() for examples you provided.

Todd Lipcon

The single-node example with fsync is probably correct given the current design of InnoDB, but it’s not fundamental — see the 2009 paper on Early Lock Release (http://infoscience.epfl.ch/record/152158) or the followup from VLDB ’10: http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

The paper doesn’t reference multi-node replicated systems, but I’m fairly certain the same optimization could be applied.

-Todd

Mark Callaghan

We implemented early lock release for InnoDB but didn’t write a paper about it. Unfortunately it broke assumptions for crash recovery so the change was removed.

Todd Lipcon

Interesting. Was this on a public branch where I might be able to see the discussion? Or FB-internal?

Serge Knystautas

Great article!

Shouldn’t you be able to use one of the percona toolkit tools to analyze where those hotspots are happening in a production environment by looking for deadlocks? They would be resolving gracefully in a single innodb instance scenario but then potentially require the application rewrite or add the wsrep_retry_autocommit setting as a short-term solution. Hopefully then you could do the analysis of whether you can move to sync repl simply by monitoring your live database rather than do analysis or tests.

Bill Karwin

Jay, I can think of a scenario where you update a row multiple times before committing:

Your apps send frequent updates into a message queue. Then a consumer thread reads from the MQ and applies updates to the database, many MQ events per commit.

If a given system uses this strategy for hotspot activity, but otherwise everything goes through synchronous replication, then it mitigates the overhead of the RTT. The events in the MQ may not be available synchronously, but that might be okay for a given workload. Not all of an application’s traffic requires the same level of synchronicity.

Mark Callaghan

All this talk of queues and they are likely to be a worst-case for Galera because there is shared state that all transactions must update. Wonder if we need skip-locked or even more support to improve innodb-as-queue concurrency with and without Galera?
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2060739900346201280

Bill Karwin

No, I was trying to describe a system that uses a separate service for a message queue, not MySQL at all. E.g. ActiveMQ, RabbitMQ, Resque, etc. All applications would send updates to the MQ in lieu of writing directly to the database. Sometime later (asynchronously), a worker app would read items from the MQ and then apply them to the database. Possibly committing sets of events in a single transaction to reduce the overhead of the RTT.

I was just trying to think of a scenario that might address your comment, “I’m not sure how useful it is to update the same row multiple times in a single transaction.”

Robert Hodges

This is an excellent article. The rule of thumb that updates on individual rows cannot happen more than once per network round-trip is the most succinct summary of the Galera conflict resolution algorithm I have seen to-date.

To get Galera to work well applications really need to avoid conflicts as much as possible. This seems to be a fundamental truth of any multi-master approach. It does not really matter whether your replication approach is fully synchronous, asynchronous, or like Galera somewhere in between. The trade-off between the approaches seems to be what kind of mess you have to deal with when conflicts do occur.

Robert Hodges

Jay, following up on your reasoning it seems as if one of the places where Galera’s optimistic locking gets into trouble is large transactions that touch a lot of rows. These can happen through administrative actions for example. Say you want to change the value of a column for 1M rows in a single transaction. If there are any other updates on those rows within the RTT limit either those latter updates fail *or* the main update does.

In fact it seems in this case that the time window for conflicts is potentially far greater than the RTT limit, which is a lower bound. Any update that commits and gets into group communications from the time of the BEGIN on the large transaction could cause the large transaction to roll back. So the conflict window is actually time_of_large_transaction + RTT.

Does this reasoning seem correct or am I missing something?

Mark Callaghan

Jay – Yoshinori found one performance bug for semi-sync — http://bugs.mysql.com/bug.php?id=69341. Wonder if there are others.

Michael

@Mark The performance bug reported by Yoshinori was in reference to MySQL 5.6 in particular. Is is correct to assume this is also occurring in 5.5? It seems to be consistent with testing I’ve done against my 5.5.29 WAN distributed Galera cluster.