I have addressed previously how multi-node writing causes unexpected deadlocks in PXC, at least, it is unexpected unless you know how Galera replication works.  This is a complicated topic and I personally feel like I’m only just starting to wrap my head around it.

The magic of Galera replication

The short of it is that Galera replication is not a doing a full 2-phase commit to replicate (where all nodes commit synchronously with the client), but it is instead something Codership calls “virtually synchronous” replication.  This could either be clever marketing or clever engineering, at least at face value.   However, I believe it really is clever engineering and that it is probably the best compromise for performance and data protection out there.

There’s likely a lot more depth we could cover in this definition, but fundamentally “virtually synchronous replication” means:

  • Writesets (or “transactions”) are replicated to all available nodes in the cluster on commit (and enqueued on each).
  • EDIT: Writesets are then “certified” on every node (in order).  This certification should be deterministic on every node, so every node either accepts or rejects the writeset.  There is no way for a node to tell the rest of the cluster a writeset didn’t pass certification (or this would be a form of two-phase commit), so the only way nodes might get different certification results is if there is a Galera bug.
  • Enqueued writesets are applied on those nodes independently and asynchronously from the original commit on the source node.  And:
  • At this point the transaction can and should be considered permanent in the cluster.  But how can that be true if they are not applied?  Because:
    • Galera can do conflict detection between different writesets, so enqueued (but not yet committed) writesets are protected from local conflicting commits until our replicated writeset is committed. AND:
    • When the writeset is actually applied on a given node, any locking conflicts it detects with open (not-yet-committed) transactions on that node cause that open transaction to get rolled back.
    • Writesets being applied by replication threads always win.
So why is this “virtually synchronous”?  Because simply getting our writesets to every node on commit means that they are guaranteed to apply — therefore we don’t have to force all nodes to commit simultaneously to guarantee cluster consistency as you would in a two-phase commit architecture.

Seeing when replication conflicts happen

This brings me to my topic for today, the mysterious SHOW GLOBAL STATUS variables called:

  • wsrep_local_cert_failures
  • wsrep_local_bf_aborts

I found that understanding these helped me understand Galera replication better.  If you are experiencing the “unexpected deadlocks” problem, then you are likely seeing one or both of these counters increase over time, but what do they mean?

Actually, they are two sides to the same coin (kind of).  Both apply to some local transaction getting aborted and rolled back, and the difference comes down to when and how that transaction conflict was detected.  It turns out there are two possible ways:

wsrep_local_cert_failures

The Galera documentation states that this is the:

Total number of local transactions that failed certification test.

What is a local certification test?  It’s quite simple:  On COMMIT, galera takes the writeset for this transaction and does conflict detection against all pending writesets in the local queue on this node.  If there is a conflict, the deadlock on COMMIT error happens (which shouldn’t happen in normal Innodb), the transaction is rolled back, and this counter is incremented.

To put it another way, some other conflicting write from some other node was committed before we could commit, and so we must abort.

This local certification failure is only triggered by a Galera writeset comparison operation comparing a given to-be-commited writeset to all other writesets enqueued locally on the local node.  The local transaction always loses.

EDIT: certification happens on every node.  A ‘local’ certification failure is only counted on the node that was the source of the transaction.

wsrep_local_bf_aborts

Again, the Galera documentation states that this is the:

Total number of local transactions that were aborted by slave transactions while in execution.

This kind of sounds like the same thing, but this is actually an abort from the opposite vector:  instead of a local transaction triggering the failure on commit, this is triggered by Galera replication threads applying replicated transactions.

To be clearer: a transaction was open on this node (not-yet-committed), and a conflicting writeset from some other node that was being applied caused a locking conflict.  Again, first committed (from some other node) wins, so our open transaction is again rolled back.  “bf” stands for brute-force:  any transaction can get aborted by galera any time it is necessary.

Note that this conflict happens only when the replicated writeset (from some other node) is being applied, not when it’s just sitting in the queue.  If our local transaction got to its COMMIT and this conflicting writeset was in the queue, then it should fail the local certification test instead.

A brute force abort is only triggered by a locking conflict between a writeset being applied by a slave thread and an open transaction on the node, not by a Galera writeset comparison as in the local certification failure.

Testing it all out

So this is the part of the post where I wanted to show that these counters were being incremented using an example from my last post.  Those examples should trigger brute force aborts, but they didn’t seem to increment either of these counters on any of my testing nodes.   Codership agrees this seems like a bug and is investigating.  I’ll update this post if and when an actual bug is opened, but I have seen these counters being incremented in the wild, so any bug is likely some edge case.

By the way, I can’t think of how to reliably produce local certification errors without just a lot of fast modifications to a single row, because those depend on the replication queue being non-empty and I don’t know any way to pause the Galera queue for a controlled experiment.

 

18 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mark Grennan

To product certification errors try using a “wayback”. 🙂 IE clever use of the sleep() function.

This post by Peter Laursen might help you understand what I’m thinking about.

http://blog.webyog.com/2012/11/13/the-wonderful-way-back-machine-in-mysql/?utm_source=rss&utm_medium=rss&utm_campaign=the-wonderful-way-back-machine-in-mysql

Sergey

Don’t have enough understanig about bf_abort.
As I know, after the certification stage transaction is physically commited on the source node. So, after the physical commit failure on the local (slave) node Galera should abort a transaction that already have been commited on the source node. Is it correct and Galera does it?

Henrik Ingo

Hi Jay

I easily get a small percentage of deadlocks just by running sysbench:

– create a table with small number of rows
– use small buffer pool / disk bound workload
– use lots of sysbench threads, multi-master of course
– expect 1-5% of deadlocks

With bigger tables and in-memory workload the amount of deadlocks is negligible so you may only get less than one per minute.

henrik

Sergey

Sorry, bit no understanding anyway.

For example, one of the question: where does Galera run Local Certification: on both sides, master and slave? I thought before that’s master’s job, it must send a certification request to slave and get some response from slave (succeess or not). Only if response is success, it can do a physical commit. Is it wrong?

gpfeng.cs

Thanks Jay Janssen. It helps me a lot

gpfeng.cs

I’m quite clear about the difference of wsrep_local_cert_failures and wsrep_local_bf_aborts after reading this blog.

From what you say:

“Writesets (or “transactions”) are replicated to all available nodes in the cluster on commit (and enqueued on each).
Enqueued writesets are applied on those nodes independently and asynchronously from the original commit on the source node. And:
At this point the transaction can and should be considered permanent in the cluster”

I get the conclusion: only local trxs will be aborted or rolled back while broadcasted and enqueued trxs will be promised to commit.

But I got confused when I read the blog:http://www.mysqlperformanceblog.com/2012/01/19/percona-xtradb-cluster-feature-2-multi-master-replication/, the diagram is similar to http://www.codership.com/wiki/doku.php?id=certification

from what I see: broadcasted trxs(sit in the queue) need to be certified and may be rollback, as the total order service guarantees that every nodes enqueue trxs in the same order and them produce same result against the certification test.

Is there something wrong with my understanding?

gpfeng.cs

Jay Janssen

Good to see the reply!

so wsrep_local_cert_failures and wsrep_local_bf_aborts only record the aborted/rollback information of the local trxs on the source node(before replicated)

one more questions:
Is there a status variable that counts the certification test failures of the replicated(enqueued and sitted) trxs? as high rate of certification test failures should be found and explained.

gpfeng.cs

>>local_cert_failure is after replication, but before final commit on the source node of the transaction

from your comment:
–>
The flow is:

source node commit: local certification, enqueue on all other nodes, commit locally

then:

–<

my understanding: when a local trx tries to commit,it must first pass the local certification and then it will be broadcasted, otherwise it will be rollback and nothing replicated to other nodes.

or is the ture logic that at COMMIT trx will be first bundled into writeset and replicated to all nodes, and then local certification will be tested on the source node?

I'm not so sure.

gpfeng.cs

my understanding of the flow:

1. source node commit:
2. local certification on source node (local_cert_failure will be incremented if failed)
3. enqueue on all other nodes if pass the local certification test
4. certification test (all nodes produce same result)
5. commit on all nodes if passed the certification test(will cause other local open trx to rollback) or rollback if failed

gpfeng.cs

Good work!

>>If you certified locally before replication, you would have a race condition where certification might fail the second time on some other node, and now you have 2 phase commit.

why 2pc is required here?

gpfeng.cs

Hi Jay Janssen:

Sorry, I should have misunderstood you, I guesed that you mean 2pc is required if the logic is certification before replication from the following:

>>If you certified locally before replication, you would have a race condition where certification might fail the second time on some other node, and now you have 2 phase commit.

and now I get everything clear, Thank you!

Pablo

Hi,

What happens if the apply fails on just one node? For example, If the disk is full. Does the apply retries indefinitely ?
In general, what are the guarantees that a WriteSet will succeed in all the nodes? Is there any?

Thanks!