State of the art: Galera – synchronous replication for InnoDB
First time I heard about Galera on Percona Performance Conference 2009, Seppo Jaakola was presenting “Galera: Multi-Master Synchronous MySQL Replication Clusters”. It was impressed as I personally always wanted it for InnoDB, but we had it in plans at the bottom of the list, as this is very hard to implement properly.
The idea by itself is not new, I remember synchronous replication was announced for SolidDB on MySQL UC 2007, but later the product was killed by IBM.
So long time after PPC 2009 there was available version mysql-galera-0.6, which had serious flow, to setup a new node you had to take down whole cluster. And all this time Codership ( company that develops Galera) was working on 0.7 release that introduces node propagation keeping cluster online. You can play with 0.7pre release by yourself MySQL/Galera Release 0.7pre.
In current version propagation is done by mysqldump from one of nodes (”donor”). In next release Codership is going to support LVM snapshot and xtrabackup which will make the setup of new node even easier. The current annoyance I see is that if you shutdown one node for short period of time for quick maintenance, after start, the node has to load whole mysqldump, like it is new empty node. I hope Codership guys will address this also.
Another thing I miss for now is support of InnoDB-plugin, which as we know performs much better than standard InnoDB ®.
So what is so interesting about Galera. Couple things:
- High Availability. Any of N standby nodes are available immediately when main node fails. Galera is serious pretender to be included to the list, Yves put recently, http://www.mysqlperformanceblog.com/2009/10/16/finding-your-mysql-high-availability-solution-%e2%80%93-the-questions/. I am not sure how many nines it will provide
, but efforts on test setup and deployment should be comparable with MMM setup.
- Scale Writes. Galera allows to write to any of N nodes and automatically propagate to other nodes. It sounds too ideal, and there is drawback – with increasing amount of nodes you write to, your transaction rollback rate may increase, especially if you working on the same dataset. You can find some results on Codership’s page, and I am going to run my own benchmarks also. Also from benchmark you can see that communication overhead maybe significant for short writes.
- Scale Reads. It can be done with regular replication, but with synchronous your “slaves-nodes” are in the same state, there is no “slave behind”. When you read from any slave, you read actual data. Although it also has serious drawback – our cluster is fast as fast the “weakest” node in the chain. So if one node gets overloaded and performance degrades, the same happens with whole cluster.
- Heterogeneous-database replication. It is not here yet, and I do not know what’s in Codership roadmap, but group manager protocol in Galera is database independent, and it’s only matter of database drivers. For InnoDB currently it is set of patches, and I see it is quite possible to make the same for Postgres. So MySQL-Postgres cluster setup is not so far ahead
On “Company page” Codership says their goal is “to promote and exploit the latest developments in computer science to produce fast and scalable synchronous replication solution that “just works” for databases and similar applications”, which I think they have success in. Implementing fast, scalable and working group communication and transaction manager is the art.
As for now I would not put 0.7 release into production yet, but you may seriously consider to play with it in test environment, and report bugs to Codership team, they are very responsive.
I am waiting for next releases and looking to make integration with XtraDB.
6 Comments











del.icio.us
digg
Hi Vadim,
Thanks for the kind words and nice title!
We are working on innodb plugin version in our current R&D sprint, it should be available
in ~3 weeks time.
So far, MySQL and mariaDB have taken all our attention and we haven’t been able to devote
much time for PostgreSQL work. But in theory, heterogeneous replication is sure possible.
However it would work only in SQL statement level replication. Note that, the write
scalability is mostly due to the effectivity of RBR events and heterogeneous cluster
would be best suitable for read scaling.
-seppo
P.S. For those, who are suspicious to run with multi-master, it is always
possible to direct writes just to one node. This setup works then as synchronous
master slave replication.
Comment :: October 27, 2009 @ 2:58 pm
Sounds like a dream, can’t wait to give it a try. I agree, shutting a node down for a period of time requiring a db dump would be a show stopper for me. Would be better if they had some form of an async “catch up” and once up to speed enable the node as being online and switch to sync mode.
Comment :: October 28, 2009 @ 6:48 am
Hi John,
“Donor” node is not “shut down”, it is just blocked for the duration of state snapshot transfer. This first implementation is using mysqldump which is slow, but at least as reliable as mysqldump itself and “just works”. Later we plan to add state transfer modes which would block donor for much shorter time, but may require some special setup (like LVM). Incremental state transfer is also in the works.
But for now you could just keep a special reserved “donor” node for such purposes, and it does not have to be as powerful as “working” nodes: applying writesets requires much less resources than serving clients.
Comment :: October 28, 2009 @ 8:35 am
What kind of performance penalty does synchronous replication incur compared to no replication and asyn replication?
In an N nodes cluster, an update will only commit after it’s been propagated to all N nodes, right? That sounds like it could introduce significant performance drop.
Comment :: October 29, 2009 @ 2:49 pm
Andy,
I did not run benchmarks myself yet, but you can find some results on Codership’s website: http://www.codership.com/en/content/benchmarking-write-scalability . I have no reason to not believe them.
As you see for short transactions the penalty may be significant (compare numbers for 1 node), and it decreases with size of transaction, which is understandable.
For sure it’s drawback of technology, but it’s price for consistent data.
Comment :: October 29, 2009 @ 3:02 pm
Andy,
This is a rather interesting question, but the answer is not so straightforward. Galera cluster performance overhead comes not only from group communication latencies, but from a number of other factors and strongly depends on the load profile, number of CPU cores, IO subsystem and network configuration.
You’re absolutely right suggesting that more nodes mean more overhead, but even with TCP transport adding new node results in disproportionally low additional latency. You may want to check http://www.codership.com/en/content/sysbench-ec2-size-matters which is about the only case where it could be reliably measured. With multicast transport communication latencies should depend on the number of nodes even less.
That said, Galera replication overhead may be significant regardless of the number of nodes. You can see this effect taken to extreme in a very synthetic mysqlslap benchmark here: http://www.codership.com/en/content/benchmarking-write-scalability. Curiously, one way to make for additional communication overhead is to increase the number of concurrent server connections. Rule of thumb is to double the amount of connections per node compared to what gives the best performance on a standalone server.
Much bigger overhead comes from certification conflicts and resulting transaction rollbacks. The conflict rate grows roughly as N^2 and is the major limiting factor for multi-master scalability.
As for comparison with async master-slave – we don’t have any numbers yet. We don’t know of any benchmarks which can be used to benchmark performance of async master-slave cluster as a whole. Comparing just master performance is not so informative as standalone server.
Vadim,
I would not call that a “drawback” – it is a limitation
. Galera replication is just a tool suitable for some tasks and unsuitable for others. As they say, YMMV.
Comment :: October 29, 2009 @ 5:30 pm