In this post I want to show new possibilities which open with Percona XtraDB Cluster.
We will create 3 nodes Cluster with nodes on different continents (Europe, USA, Japan) and each node will accept write queries.
Well, you theoretically could create 3 node traditional MySQL ring replication, but this is not what you want to use day-to-day.


To show how it works I will use Amazon m1.xlarge instances, by one in Tokyo, Ireland and North California, running RedHat Entreprise 6.2 64bit.

In fact to create instances is most time consuming task. After that using my script you will have cluster running in 5 min or less.

There however some precautions needed if you run Amazon instances.
First, you need to open ports in the firewall. For the communication the nodes need, by default, 4444, 4567, 4568 ports
(see our FAQ why)

Second, please take into account that the communication is done using open channels, and you may want to establish an encrypted connection, using it in real life.

Now, as we have running instances, you can install Cluster packages from RPM repositories.
You can follow steps from documentation.

Or I prepared simple script which does all the work: http://percona.com/downloads/Percona-XtraDB-Cluster/scripts/intercont.tar.gz,
you also can follow steps from the script to adjust it for your environment.

You just need to change nodes host names and your keys in the file: install.nodes.sh

When all nodes start, we have running 3 nodes EUUSAJAPAN, and each node is ready to execute both read and write queries.

Of course you may wonder what is query response time in such environment. We can check it.

Simple table: CREATE TABLE t (ID INT) and simple query: INSERT INTO t VALUES (1)

Response time on single node in EU (no cluster setup): 0.005100 sec
Response time on two-nodes (EUJAP) cluster: 0.275642 sec
Response time on three-nodes (EUJAPUSA) cluster: 0.294754 sec

Well, one may argue that 0.27 sec for single query is kind of big, but this is the physic law, you cannot
go faster than speed of light, and the round trip between Europe and Japan takes time (or at least until scientists figure out how to attach transaction to Faster than light neutrino).
Also note, that XtraDB Cluster can apply events in parallel and throughput should be less affected by big distance.

Enjoy!

31 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Harrison

.25 seconds for a write seems pretty slow, particularly if you have to do multiple per request. How does galera work with batching?

If you did a multi-insert packet, would it be .25? If you did a multi-row insert would it be .25 seconds? If you had a SP which did 10 writes, would it still be roughly .25?

Harrison

Yup, sounds good.

Florian Haas

For the sake of comparison, what’s the latency penalty when running on real hardware, in a directly switched Gigabit Ethernet network?

Henrik Ingo

To continue Harrison’s trail, it now depends on what you are comparing against whether this is truly amazing, ok or a degradation. The question, is what have you done with 3 continents until today?

If the answer is that you have a readonly datacenter with mysql slaves, and all writes are directed to one master data center, then this is a lot faster. You can now write at any datacenter, which is closest to the user. The commit will be slower, but the actual inserts and updates are now fast. On average it is a net gain (if you have users spread around the world).

If the answer is that you already write in multiple data centers, for instance using a “Master of Record” scheme, then this is not faster and might be slower. Otoh this is much simpler to manage as you don’t need to map each user to a “home” datacenter.

Florian: Codership has done a similar EU-US test with throughput and latency numbers published here: http://www.codership.com/content/synchronous-replication-loves-you-again

Florian Haas

Ingo, thanks a lot, but while that post claims to compare performance both in a LAN environments and in EC2 instances, all I actually see is single-node vs. EC2 numbers. Am I missing something? Please point me to the single-LAN, multi-node, multi-master replication numbers if that post (or another) has them. Thanks.

Florian Haas

Henrik, sorry for mixing up your first and last names. What was I thinking? I’m juggling a few tasks here today.

Henrik Ingo

Hehe, it’s common with Germans and I’m completely ok with it 🙂

The codership blog post has tests with:
– single node
– clusters inside one Amazon data center
– clusters spanning Amazon data centers on two continents

Was there something more you were looking for?

Florian Haas

Hehe, it’s common with Germans and I’m completely ok with it

… as it seems common among Finns to mix up Germans and Austrians. 🙂

I was looking for benchmarks for >= 2 non-virtual hosts running in a local LAN with synchronous replication, no EC2 involved.

Henrik Ingo

Ok, touche, but still within German speaking culture.

I have tested Galera non-virtualized, but then you don’t get a comparison against multiple data center setups:
http://openlife.cc/category/topic/galera

Florian Haas

I’m actually not so much looking for multi-DC setups at this point (although having something like that for MySQL will come in handy), I’m just trying to get an idea about the performance in comparison to “traditional” MySQL HA solutions.

And http://openlife.cc/blogs/2012/january/re-doing-galera-disk-bound-benchmark looks like it’s just what I’ve been looking for, thanks!

Henrik Ingo

Oh, sorry, I didn’t publish a lot of latency numbers though. As far as I remember (I could dig them up from logs one day) they are like for semi-sync replication: noticeable but not of any practical consequence. Most importantly tps on a parallel workload is equal or better than a single node, so 1 ms of latency up or down is a nice tradeoff to make.

marrtins

What happens if say one continent goes offline for a day? Is synchronization done automatically after re-join?
P.S.
This CAPTCHA is very hard to crack.

Alex

@marrtins:
Indeed. It will either replay missed transactions of take full DB snapshot via mysqldump/rsync/xtrabackup – depending on the size of transaction cache on other nodes.

Max

Looks pretty intersting. It’s been a awhile since I looked at Galera so I can’t remember, but is there an asynch mode or do you have to synchronously commit everything?

Henrik Ingo

Max: If you want async replication you can just use MySQL replication alongside Galera. For instance 1 node in a Galera cluster can also be a MySQL replication master, to which some other MySQL servers connect.

Otoh Vadim’s tests here show that the most likely reason you’d want to do that is moot: You can just run a Galera cluster over long distances and it will work and the overhead is still acceptable.

Max

Henrik: I think it’s fair to say that 0.25 latency can be too high for many applications. It would be cool to be able to decide between synch and asynch on a database/table/query basis.

Henrik Ingo

Max: If you want your write to be synchronously written to 2 or more continents, then it’s not going to be faster than that. All of the magical NoSQL solutions (MongoDB, Voldemort, etc…) would be equally slow (or more, considering they don’t support multiple operations within one transaction) too, or you would have to run them in a mode where you don’t really commit to more than one data center after all. Here too you can of course choose to use MySQL replication for Data Center redundancy and only use Galera within a data center (like you would with NDB). But then you’ll have to deal with the mess of eventual consistency some other way. A 0.2 sec latency on commit might be a nice tradeoff to pay for that convenience.

Note that this latency only applies for writes. Reads will not have this latency at all. Also the latency happens at commit time – it will often be possible to design your application (with AJAX or other asynchronous means) such that it is very responsive for any interactive part, and then the commit is more or less silent when everything else is done (only errors would bring something back to the user).

Max

Henrik:Totally agree with you. I was referring to being able to do some operations in asynch mode to speed up the writes. I agree that 0.25 latency for intercontinental transactions is good. It might still be too much in some cases though.

Maarten

It’d be cool to have benchmarks of a setup that I think a lot of companies looking for more or less HA are looking for that are not operating worldwide: a cluster in 2 datacenters in 2 cities in the same country.

Henrik Ingo

Maarten: Benchmark results are of course the real facts, but based on my experience, a Galera cluster spanning something like 50 km with good fiber connection between will perform really well. Most workloads you might not see any overhead at all.

However, note that the way Galera handles failovers by majority, you’d really want to have 3 separate data centers.

Suman

Hi, thanks for your notes and the setup script. I had a quick question: I set up three nodes on three different Availability Zones (US East, US West and Europe) to simulate this cluster, and when I run the script, it gives me this error message:
“mysql_install_db: command not found”

I’m not able to find mysql_install_db anywhere on the file system either.

I tried this with both Amazon’s default as well as RHEL 6.2 stock AMI images (I had to modify the script slightly for the Amazon image) but it didn’t work on either.

Should I install Percona server: https://www.percona.com/doc/percona-server/5.5/installation.html
on these images before running this intercontinental script?

Suman

Sorry I figured out how to get the script working.

Just need to ADD these two lines:
yes | yum -y remove mysql-libs
yes | yum -y install Percona-XtraDB-server Percona-XtraDB-client

right BEFORE:
yes | yum -y install Percona-XtraDB-Cluster-server Percona-XtraDB-Cluster-client

and then the mysql_install_db will be installed properly.

Steve Mushero

Great stuff but I worry what happens when the link is poor, like we have in China to the outside world. If the link is down/bad for an hour, one server will be thrown out of the cluster, I assume, but what to do with the clients of that DB, and how to redirect them to other DBs, especially if the orphaned server takes a long time to recover. I believe on the LAN Galera will move virtual IPs, but that’s useless for WAN use, I’d think.

Also, it’s nice to talk about re-sync by snapshot or backup but at 100GB+ and global links, this is a many hours/days effort to just move the data, so not realistic. It would have to re-sync automatically as best speed, like a regular MySQL slave would.

Alex

Steve:

1) I guess in WAN case you could use DNS-based failover, which is analogous to VIP moving. However in your case the server in China must be dedicated to Chinese clients probably because it is too slow for the clients to reach the outside. And so if it looses connection to the outside, most likely the clients do too or it is too slow for them anyways.

2) As of version 2.0 Galera now supports missed transactions replay on rejoining the cluster, like regular MySQL slave. Except that it can do it much faster thanks to parallel applying.

Steve Mushero

Yes, we could, and we do this even in LAN based failover if necessary, via local host files or local/remote DNS. Fast enough as long as not global public DNS.

Bigger problem is the slow reads this creates – the cluster is great because normally only writes are slow, but reads are local and fast, but with the local node out of the cluster, now both reads and writes are slow.

Can we have a local read-only slave from the cluster, i.e. use normal MySQL replication with the cluster ?

Alex

Steve,

1) yes, you can use normal MySQL master-slave replication from/to any node in the cluster.
2) you still can use disconnected node for reads if stale data is acceptable. Just set wsrep_on=0 globally or per session in addition to read_only.

laapsaap

Article should be corrected. Faster-than-light neutrino anomaly was caused by a measurement error, it was within speed of light in reality.

Nikke

What is the latency between the nodes? Trying to compare this to a data ceter build that I am interested in.