During the design period of a new cluster, it is always advised to have at least 3 nodes (this is the case with PXC but it’s also the same with PRM). But why and what are the risks ?

The goal of having more than 2 nodes, in fact an odd number is recommended in that kind of clusters, is to avoid split-brain situation. This can occur when quorum (that can be simplified as “majority vote”) is not honoured. A split-brain is a state in which the nodes lose contact with one another and then both try to take control of shared resources or provide simultaneously the cluster service.
On PRM the problem is obvious, both nodes will try to run the master and slave IPS and will accept writes. But what could happen with Galera replication on PXC ?

Ok first let’s have a look with a standard PXC setup (no special galera options), 2 nodes:

two running nodes (percona1 and percona2), communication between nodes is ok

Same output on percona2

Now let’s check the status variables:

on percona2:

only wsrep_local_index defers as expected.

Now let’s stop the communication between both nodes (using firewall rules):

This rule simulates a network outage that makes the connections between the two nodes impossible (switch/router failure)

We can see that the node appears down, but we can still run some statements on it:

on node1:

on node2:

And if you test to use the mysql server:

If you try to insert data just while the communication problem occurs, here is what you will have:

Is is possible anyway to have a two nodes cluster ?

So, by default, Percona XtraDB Cluster does the right thing, this is how it needs to work and you don’t suffer critical problem when you have enough nodes.
But how can we deal with that and avoid the resource to stop ? If we check the list of parameters on galera’s wiki wcan see that there are two options referring to that:

pc.ignore_quorum: Completely ignore quorum calculations. E.g. in case master splits from several slaves it still remains operational. Use with extreme caution even in master-slave setups, because slaves won’t automatically reconnect to master in this case.
pc.ignore_sb : Should we allow nodes to process updates even in the case of split brain? This is a dangerous setting in multi-master setup, but should simplify things in master-slave cluster (especially if only 2 nodes are used).

Let’s try first with ignoring quorum:

By ignoring the quorum, we ask to the cluster to not perform the majority calculation to define the Primary Component (PC). A component is a set of nodes which are connected to each other and when everything is ok, the whole cluster is one component. For example if you have 3 nodes, and if 1 node gets isolated (2 nodes can see each others and 1 node can see only itself), we have then 2 components and the quorum calculation will be 2/3 (66%) on the 2 nodes communicating each others and 1/3 (33%) on the single one. In this case the service will be stopped on the nodes where the majority is not reached. The quorum algorithm helps to select a PC and guarantees that there is no more than one primary component in the cluster.
In our 2 nodes setup, when the communication between the 2 nodes is broken, the quorum will be 1/2 (50%) on both node which is not the majority… therefore the service is stopped on both node. In this case, service means accepting queries.

Back to our test, we check that the data is the same on both nodes:

Adding ignore quorum and restart both nodes:

set global wsrep_provider_options=”pc.ignore_quorum=true”; (this seems to not work properly currently, I needed to change it in my.cnf and restart the nodes)

wsrep_provider_options = “pc.ignore_quorum = true”

break again connection between both nodes

iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

and perform an insert, this first insert will take longer:

and on node2 you can also add records:

The wsrep status variables are like this now:

Then we fix the connection problem:

iptables -D INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

nothing changes, data stays different:

You can keep inserting data, it won’t be replicated and you will have 2 different version of your inconsistent data !

Also when we restart a it will request an SST or in certain case fail to start like this:

Of course all the data that was written on the node we just restarted is lost after the SST.

Now let’s try with pc.ignore_sb=true:

When the quorum algorithm fails to select a Primary Component, we have then a split-brain condition. In our 2 nodes setup when a node loses connection to it’s only peer, the default is to stop accepting queries to avoid database inconsistency. We can bypass this behaviour by ignoring the split-brain by adding

wsrep_provider_options = “pc.ignore_sb = true” in my.cnf

Then we can insert in both nodes without any problem when the connection between the nodes is gone:

When the connection is back, the two servers are like independent, these are now two single node clusters.

We can see it in the log file:

Now when we put the two settings to true, with a two node cluster, it acts exactly like when ignore_sb is enabled.
And if the local state seqno is greater than group seqno it fails to restart. You need again to delete the file grastate.dat to request a full SST and you loose again some data.

This is why two node clusters is not recommended at all. Now if you have only storage for 2 nodes, using the galera arbitrator is a very good alternative then.

On a third node, instead of running Percona XtraDB Cluster (mysqld) just run garbd:

Currently there is no init script for garbd, but this is something easy to write as it can run in daemon mode using -d

If the communication fails between node1 and node2, they will communicate and eventually send the changes through the node running garbd (node3) and if one node dies, the other one behaves without any problem and when the dead node comes back it will perform its IST or SST.

In conclusion: 2 nodes cluster is possible with Percona XtraDB Cluster but it’s not advised at all because it will generates a lot of problem in case of issue on one of the nodes. It’s much safer to use then a 3rd node even a fake one using garbd.

If you plan anyway to have a cluster with only 2 nodes, don’t forget that :
– by default if one peer dies or if the communication between both nodes is unstable, both nodes won’t accept queries
– if you plan to ignore split-brain or quorum, you risk to have inconsistent data very easily

10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Peter Zaitsev

Fred,

I see when you’re trying to work with node1 you seems to get different error message “deadlock” and “unknown command” why is this ? Would not it be more practical if some same error message is used for all cases when cluster is in the state it can’t run command.

Asko Aavamäki

Nice rundown!

I wonder if another article could demonstrate how that two node cluster would perform during that IST/SST mentioned at the end after the other node rejoins the cluster. Would this show that the node that remained operational would be blocked as well in order to server the SST requests to the node that is returning to the cluster, making the whole DB inoperational?

Maybe the effect could be demonstrated both for mysqldump and rsync SSTs (or even XtraBackup if that could be used to achieve non-blocking in that scenario!)

The issue is noted in documentation and is the other big reason for having a minimum of three nodes in addition to split brain etc. but I think it’d be good to see what that type of situation looks like in the logs so if someone runs in to it, it’s less of a mystery.

Balasundaram Rangasamy

Nice one. I had similar problem, when I was working with only 2 node cluster. This article nicely explains why we should not go for 2 node cluster.

When I try to update a table on both node simultaneously, I did deadlock error, unknown command and even second node becomes unavailable.

Daniël van Eeden

REJECT doesn’t really simulate a failed network connection as a ICMP reject will be sent. Using DROP is a better test. It probably doesn’t change the results… and a misconfigured router/switch might be more common then a failed network connection.

Jalvin Trivedi

If i am working on two nodes if i rais one query after some time node1 is fail but still query is not complite then what happen about the query ???? Is it running continuously or restart ????????

Jonathan Fisher

Is it possible to have a master/slave setup with Percona cluster? Something like if the master is restarted, the slave takes over writes, but the master is caught up when it comes back online?

Steve Jackson

Hei Frederic

We have been experimenting with sb_ignore=true to develop a master-slave setup of sorts, where we know, and accept that split brain can potentially occur, but we deal with that issue ourselves, as we know which the active node is supposed to be, and are only writing to that node. We simply want the ease of architecture that comes with the cluster software.

What we found is, when the “failed” node is restarted, he will begin an SST, but about 80% of the time, the VM we are testing on will become unresponsive. It appears that fs is filling up, but the db is very small.

Is this anything you have experienced or heard about? I have searched the net a bit, but I guess there are few people using sb_ignore=true, thus few reports on this issue

Ulises_Vargas

Hi! i have one question, i have a percona xtradb cluster with 3 nodes, i restarted mysql service in one node and the service is not started or start with a error:

“mysql.service – Percona XtraDB Cluster
Loaded: loaded (/usr/lib/systemd/system/mysql.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since lun 2018-11-12 15:13:58 CST; 3h 20min ago
Process: 21428 ExecStopPost=/usr/bin/mysql-systemd stop-post (code=exited, status=0/SUCCESS)
Process: 21398 ExecStop=/usr/bin/mysql-systemd stop (code=exited, status=2)
Process: 20816 ExecStartPost=/usr/bin/mysql-systemd start-post $MAINPID (code=exited, status=1/FAILURE)
Process: 20815 ExecStart=/usr/bin/mysqld_safe –basedir=/usr (code=exited, status=1/FAILURE)
Process: 27167 ExecStartPre=/usr/bin/mysql-systemd start-pre (code=exited, status=1/FAILURE)
Main PID: 20815 (code=exited, status=1/FAILURE)

nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Starting Percona XtraDB Cluster…
nov 12 15:13:58 DES-TX-NODE01 mysql-systemd[27167]: WARNING: Another instance…
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: mysql.service: control process exi…1
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Failed to start Percona XtraDB Clu….
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Unit mysql.service entered failed ….
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: mysql.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

but the variables ‘wsrep_local_state_comment’,’wsrep_ready’,’wsrep_connected’ is the same in the 3 nodes and the replication is Ok i can do create,insert,delete, drop in anywan node and replicated good, but i can’t restart service mysql, any idea what happend?.