August 20, 2014

Automatic replication relaying in Galera 3.x (available with PXC 5.6)

A decade ago MySQL folks were in love with the concept of a relay slave for MySQL high availability across data centers.  A relay is a single slave in a remote data center that receives replication from the global master and, in turn, replicates to all the other local slaves in that data center.  This saved a lot of bandwidth, especially back in the days before memcached when scaling reads meant lots of slaves.  Sending 20 copies of your replication stream cross-WAN gets expensive.

In Galera and Percona XtraDB Cluster (PXC), by default when a transaction commits on a given node it is sent to every other node in the cluster from that node.  That is, the actual writeset payload (the RBR events) are sent over the network to every other node, so the bandwidth to replicate is roughly:

If any of your nodes happen to be in a remote data center, the replication is still duplicated for each remote node, much like a master-slave topology without a relay.

Replication traffic with default Galera tuning (and pre-3.x)

To illustrate this I setup a 3 node PXC 5.6 cluster test environment: (it would work the same on PXC 5.5 and Galera 2.x)

nosegments

 

This isn’t the best design for HA, but let’s assume nodes 2 and 3 are in a remote data center.  If I use some simple iptables ACCEPT rules in the OUTPUT chain, I can easily track the amount of bandwidth replication uses on each node in a simple 1 minute sysbench update-only test that writes only on node1:

We can see that node1 sends a full 18M of data to both node2 and node3.  The traffic from nodes 2 and 3 between each other and back to node1 is group communication, you can think of it as replication acknowledgements and other cluster communication.

Replication traffic with Galera 3 WAN segments configured

Galera 3 (available with PXC 5.6) introduces a new feature called WAN segments that basically implements the relay-slave concept, but in a more elegant way.  To enable this, we simply assign each node in a given data center a common gmcast.segment integer in wsrep_provider_options.  Each data center must have a distinct identifier and each node in that data center should have the same segment.

If we apply this configuration to our above environment where node1 is in gmcast.segment=1 and nodes 2 and 3 are in gmcast.segment=2, we get the following network throughput from the same 1 minute test:

We can now clearly see that our replication is following this path, using node2 as a relay:

segments

 

So our hypothetical WAN link here between segment 1 and segment 2 only needs a single copy of the replication stream instead of one per remote node.

But why is this better than a regular old async relay slave?  It’s better because node2 was chosen dynamically to be the relay, I did not configure anything special besides the segment designation.  The cluster could have just as easily chosen node3.  If node2 failed, node3 will simply take over relay responsibilities (assuming there were more nodes).

Further, as I understand the feature, there’s nothing forcing all replication to get relayed through a single node in each segment.  Any given transaction from any given node in the cluster might use any node in a given segment as a relay.  The relaying is actually per-transaction and fully dynamic.  No fuss, no muss.

What about commit latency?

Astute readers know that node1 still must ultimately get acknowledgement from all other nodes before responding to the client.  When we are using segment relays, this should add some latency to commit time.

In my testing I was on a single virtual LAN, but my commit latency averages came out about pretty close.  I also setup a WAN environment on AWS where node1 was in us-east-1 and nodes 2 and 3 were in us-west-1 and the difference in commit latency was effectively nil.

chart_1 (1)

The additional latency is about 1ms in the LAN test case, these are 3 VMs on the same physical host, so there’s probably some additional overhead here in play.  The high latency between the data centers fully masks the relaying overhead in a true WAN case.

Here are the raw results from the WAN tests:

No Segments

With Segments

 

Test for yourself

I built my test environment on both local VMs and in AWS using an open source Vagrant environment you can find here: https://github.com/jayjanssen/pxc_testing/tree/5_6_segments (check the run_segments.sh script as well as the README.md and documentation for the submodule).

We’ve also released Percona Xtradb Cluster 5.6 RC1 with Galera 3.2 , the above Vagrant environment should pull the latest 5.6 build in automatically.

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

Comments

  1. gpfeng.cs says:

    Very interesting feature, the outgoing bandwidth of the committing node can be saved with segment configuration.

Speak Your Mind

*