I wrote about Galera about 1.5 years ago: State of the art: Galera – synchronous replication for InnoDB. It was about the 0.7 release, which was more like a proof-of-concept release (though Galera’s developers may not agree with that ) with some serious limitations (like using mysqldump for node propagation). The Galera team heard my suggestions and the new 0.8 release looks very promising. Well, it took 1.5 years to fix the limitations and come up with new features, but there is nothing to complain about it – a synchronous distributed transactional system is not an easy problem to solve, trust me.
So Galera 0.8 comes with many nice features:
- Works with MySQL 5.1 and MariaDB 5.1. The latest is more interesting for us, as it is based on XtraDB. That means Galera supports the XtraDB storage engine.
- Support for multi-threaded slaves.
- Using custom scripts for node propagation.
RSYNC
method comes with the Galera distribution, and it is quite easy to add support for Percona XtraBackup to propagate nodes.
Why MySQL/MariaDB + Galera 0.8 is interesting? It allows to solve following problems:
- A Real High Availability solution for systems based on InnoDB/XtraDB. The recommended setup is 3 nodes and you can add / remove nodes almost transparently.
- It is possible to use it in traditional master-slave setups, but with big difference – with
Galera
we have Semi-synchronous slaves. How it works on slaves: slaves just acknowledge reception of a network packet (not a transaction). Transaction is not guaranteed to be applied – what is guaranteed is that every node will do exactly the same thing with it. With parallel applying on slaves, the latency of round-trip transactions should be in an acceptable range. - It opens the possibility for active master – active master setups. You can write on both masters and do not worry about conflict resolution and get rid of those “slave is out of sync with master” pain-in-the-neck problems.
- Combining all above, now we can setup distributed replication systems with masters in different data centers. This provides a HA solution for MySQL setups in the Cloud. E.g. with current state of MySQL EC2 setups suffer from lacking a good HA schema. With Galera, we can setup replication with the same availability and in different zones. Remember the recent and famous EC2 outage? Having masters in USA and Europe regions would solve this kind of problems.
- Scaling Writes. From benchmarks provided by the Galera team we may see a good scaling of throughput writing to several nodes
As you may see, I am pretty excited that this solution is available for MySQL users as Free / Open Source Software.
“Where is the catch?”, you may ask. Yes, there are couple of points to consider:
- Complexity of setup: I have been playing with Galera 0.8-pre for couple last weeks, and from my observation, the task to setup three node cluster is much more complex than an average MySQL master-slave setup.
- Potential performance penalty: Although to have performance numbers I am going to run different benchmarks, I expect that for simple master-slave setups the response time and throughput will be affected (and not towards better side). But this is price to pay for synchronous not-getting-behind slaves. With additional nodes, the response time will only increase. And in multi-nodes setups, the performance of whole cluster will be defined by slowest server, so it will be recommended to have uniform servers across the cluster.
From my experiments with MariaDB/Galera 0.8 I have one serious feature request for the Galera team: provide the ability for incremental node provisioning.
By this I mean that if right now a node gets disconnected from the cluster, in order to join to the cluster again, it has to copy the whole data set again. But if it was disconnected only for a short period of time, we may want to copy only changes during this period. I believe that, with integration with Percona XtraBackup and its incremental backup features, it is possible to have incremental node provisioning.
To finalize this post, let me invite you to join me in testing MySQL/Galera 0.8, binaries are available from Launchpad.
The downside of MySQL/Galera is that it is based on standard InnoDB 5.1 instead of InnoDB-plugin. Standard InnoDB is seriously behind InnoDB-plugin in terms of features and performance.
Update (6-Jun-2011): It was my misunderstanding: Galera does work with InnoDB-plugin also.
You may try also MariaDB/Galera based on XtraDB, but it is only available in source code, and you may need system based on RedHat 6 or similar to have it compiled. There is also a helpful Wiki page with a bunch of information about Galera replication.
Yes, ok so I forgot that part. It is completely possible that even for 100% write load multi-master cluster would produce some benefit. Certainly HandlerSocket too has proven to us that there is a lot of overhead in the parser and optimizer. But in row based replication we only bypass parser – some optimizer overhead remains – although the row events can be simpler than the original DML statement.
thanks Henrik, Alex…
that makes sense, the RBR event at 2X the speed of the statement for replication would explain the slight increase in write performance for more nodes in those 100% insert benchmarks.
Henrik, mike: Indeed, for 100% write load scalability will be far from linear or maybe even negligible (100% inserts is very hard to scale). But for one thing, Galera uses RBR events for replication, which are rather fast to apply. For example in master-slave sysbench benchmarks I observed slave CPU load to be 1/10 of master’s (to surprisingly good precision). Thus, if you consider that sysbench is almost 25% writes (we don’t know how much of CPU they take though), we can speculate that RBR event is about 2 times faster to apply than the STATEMENT.
mike: You are asking exactly the right question. The answer depends on what you mean by “write scalability”. The published Galera benchmarks have been done with sysbench or dbt2 benchmarks, where you run relatively complex, multi-statement transactions. Their transactions are roughly 20-25% writes, rest are selects. When using replication, only the writes need to be applied on the other nodes, so in an active-active cluster you can scale these read-write transactions about 4-5x. Which is exactly what their results indicate.
But for me the huge expectations with Galera is not so much to get write scalability, rather than just better high-availability (better = easier to setup, easier and less downtime in failure situation and more robust) than current solutions.
This may be a bit of a basic question, but I’m not clear on how Galera is scalable in terms of write performance if each node is more or less an interchangeable DBMS replica that seemingly would need to be written to (now, I’ve seen benchmarks where 100% writes appear somewhat scalable but no explanation as to why)… Or is the answer that you can’t and is similar to straight replication where you will reach a write performance bottleneck?
Seppo,
Sorry for causing confusion, I updated post.
I’m getting questions about innodb_plugin support, and I need to clarify that a bit here.
Apparently it may not be obvious, especially when using the demo package distribution, that innodb_plugin is actually supported both by MySQL and MariaDB versions.It is just a matter of configuration to switch using the innodb_plugin engine.
With the demo distribution innodb_plugin can be started simple by:
$ mysql_galera -g gcomm:// –plugin start
As long as the query does not attempt to lock resources taken by the DDL it will go through. However COMMIT (or autocommit DML) will be blocked until DDL completes.
So regular SELECTs are not affected.
Alex,
Does Server-level TOI mean that cluster won’t be able to execute write transactions
to on whole cluster, but read queries will be allowed, or it is total locking ?
Vadim:
http://www.codership.com/wiki/doku.php?id=faq&#qhow_ddls_are_handled_by_galera
Note that we currently have an annoying deviation of the described behaviour which you experienced in your testing. To be fixed in the next release: lp:798202
Seppo,
Thanks for detailed answer, much appreciated.
Can you describe yet how “ALTER TABLE ” statements work now in 0.8 ?
Uhh, so many comments here, I try to address at least some of them…
* autoinc lock mode 2 was chosen as it was simple to support. Galera has also automatic control of autoincerement tuning (increment, offset), based on changes in cluster configuration.
* handlerSocket support has indeed not been tested at all. I browsed briefly through the code and my expectation is that HandlerSocket/Galera would work in master/slave topology when there is no concurrent DDL load. In multi-master usage, I expect issues in notifying the client about replication conflicts. But anyways, HandlerSocket looks like a powerful feature which we need to support, I will add an item in our work queue.
* causal reads feature is in the works and is within short reach. Read causality will be a session option, so that each session can choose if he wants to see causally ordered reads or if he prefers as fast reading as possible.
* Galera uses MySQL replication events, which are “caught on the fly”, binlog file is not necessarily needed. However logging to binlog files can be enabled just by setting the ‘log-bin’ option. (Then Galera node can operate as MySQL master as well)
* Galera global transaction ID could be mapped with binlog positions and/or LSNs. It is just a matter of implementation decision to choose where and how to store this mapping
* Galera replcates DML currently in ROW format but support for STATEMENT and MIXED formats will be added soon. DDL is replicated in STATEMENT format already now.
BTW, one critical 0.8.0 issue was caught by Oli (FromDual): wsrep_sst_auth contains authentication information, which is visible to the world through global variables. This is a problem in mysqldump based configurations, and we will prepare a hot-fix for that.
Vadim, thanks for the nice article and happy benchmarking!
Mark,
Agreed, I should retract that citing. This is to add to the point that semi-sync replication should be tested yet properly and to get performance numbers.
@Vadim Tkachenko,
I am surprised that you cite the “benchmark” from datacharmer. That was a too-simple test with a silly conclusion. The measured overhead was ~200 microseconds per commit and the conclusion was that the overhead would be unacceptable for many customers.
@Andy,
There is not much deployments yet of both MySQL semi-sync and Galera, so it will be more speculation then real facts, but I can try.
– With MySQL semi-sync you have only one slave out of many to be warranted, other slaves may get behind.
– With MySQL semi-sync you still not able to write to both masters (well, you can, but most likely you are looking for problems)
– MySQL semi-sync performance is yet to understand.
– MySQL semi-sync is still single-threaded, which again not in favor to performance.
For MySQL semi-sync advantage is that comes out of box in MySQL 5.5 and it is based on regular MySQL replication which is well understood by MySQL users.
@Vadim Tkachenko,
In your earlier comment you said Galera is just like 5.5 semi-sync replication. So how do the 2 compare to each other? Any reason to use Galera instead of 5.5 semi-sync or vice versa?
Thanks
Fantastic. Many thanks Vadim.
Tim Vaillancourt
@Tim,
I have started wsrep_sst_xtrabackup, so it will be available.
As for traditional replication – as now Galera uses binlog in ROW format,
I do not see problem that another regular slave connects to cluster and
uses binary log for regular replication. Obviously only ROW format supported.
I am going to run benchmarks during summer ( and publish it). Benchmarks results and
stable work will define if we want to add Galera to Percona Server. Our intent is to extensively evaluate it.
I’ve setup a 3 node Galera cluster with 2 mysql proxy nodes in my testbed and have been really, really impressed by this technology. I agree that XtraBackup would be a great candidate for a new State Snapshot Transfer methods.
Does traditional replication work with the 0.8 release, or still a feature request/bug? In my mind the ability to do regular builtin replication is important for geographically-dispersed clusters. The ability to perform async replication to another datacenter, for example, would be valuable in some use cases.
Does Percona plan to do any benchmarking of Galera?
Could/would Percona consider adding this to Percona Server?
Exciting stuff! Thanks for the great article, as always.
Tim Vaillancourt
@Andy,
I maybe mistaken, I did not play much with 0.7 and it was ~year ago,
but as I remember binlog was not required for 0.7.
So for 0.8 the requirement binlog=ROW is new.
@All,
There is comment on semi-synchronous from Alex Yurchenko @ Galera
“Technically speaking Galera is semi-synchronous – just like semi-sync in 5.5. In fact there are no checks made on slaves – they just acknowledge reception of a network packet (not even a transaction). Transaction is not guaranteed to be applied – what is guaranteed is that every node will do exactly the same thing with it”
I will update post to have correct information.
I love these threads, it seems thinking about replication problems really inspires me…
@Robert: Yes, that’s of course exactly what happens on commit. Thanks for filling in the gaps. So slaves only acknowledged that they have received the transaction (and even then a majority of slaves is enough?). Since the Galera cluster imposes a consistent ordering among transactions, there’s no need for everyone to certify anything, it is enough that one node guarantees it can be committed. (I have still no idea how that guarantee is given though, but so far this makes sense.)
@Vadim Tkachenko: If you store the Global transaction ID into an InnoDB table, there you have a relation to LSN for free. (Credits to Drizzle for this idea.)
Someone running with relaxed durability may still want to have replication, but if you are willing to relax durability anyway, then imho the relevant comparison is with something like mysql asynchronous replication, or why not Galera in asynchronous mode (or Tungsten, since Robert is here . If you add synchronous replication to something that already had innodb_flush_log_at_trx_commit=2 then you get significant improvement in durability. Of course, it is still a relevant question to know how much performance you have to trade off, but it doesn’t mean anything negative about the replication solution.
@Vadim Tkachenko,
Is the binlog requirement new to 0.8? I seem to remember reading in this blog last year that Galera didn’t require binlog for replication.
@Andy,
If slave crashes – after restart it has to copy all data from working node.
So it won’t have out of sync state.
@Andy,
Galera 0.8 requires binlog in ROW format. They use it to send and apply events.
@Henrik,
If Galera is not 100% synchronous, then it’d still have the “slave out of sync” problem, no?
Say a client writes to a master, the slave validates the transaction, returns commit to master, which commits the transaction. But before the slave could commit the transaction it crashes. So now the transaction is missing in the slave.
How do you handle cases like that?
Regarding the sync_binlog setting in benchmarking :
For single node, no replication is going on so there’s no need to even enable the binlog, right?
For Galera cluster, does it require binlog to run? I seem to recall reading a long time ago that Galera replication doesn’t need binlog. is that true?
@Henrik
I agree that question on innodb_flush_log_at_trx_commit=1 and sync_binlog=1 ( and Galera also recommends innodb_doublewrite=0 in cluster setup) is rather tricky, and we may need to cover different combination.
My couple cents: you may use innodb_flush_log_at_trx_commit=2/sync_binlog=0 in single node setup, but you still may want to have HA. So it is not necessary that single host performs with innodb_flush_log_at_trx_commit=1. And RAID with BBU is must have for high performance setups, so it rules out problems with innodb_flush_log_at_trx_commit=1.
I still want to have innodb_doublewrite=1 even in cluster setup, to be able to recovery single node in short time. With support of incremental changes apply it will become more important.
As to ability to apply only lost transactions: in fact Galera uses binary log events in ROW format. There is just no direct relation between binlog_event and Global Transaction ID. I am not sure how hard to implement support for this, maybe Alex / Seppo can answer. From my point it should be possible to have support of XtraBackup incremental schema ( in this case we need to have relation between LSN and Global Transaction ID, which should be possible).
@Henrik
Thanks for info about “semi-sync” nature and possible related problems, that’s good to have in mind.
– On complexity of setup: may be it is just matter this is new product with new features and we just need to get to use it. Also proper packaging and default configuration should solve many questions.
– For auto_increment: I did not research much yet, but I see that Galera strictly requires innodb_autoinc_lock_mode=2 and tries to manage innodb_autoinc_lock_mode / auto_increment_offset
– For deadlocks: I definitely expect bigger rate of deadlock when we write to several nodes and touch the same areas. In fact you can see it from some Galera benchmarks.
@Hingo, You will definitely get a lot of rollbacks in Galera if you have a busy cluster with many nodes updating the same location(s) within a short period of time. This is a characteristic of the certification method used to validate transactions and is effectively a new kind of deadlock. As a practical matter I would guess that this might turn out to be a problem in systems that have regular transaction processing while one node is doing a large batch operation that affects many rows.
One other point: Galera has perhaps changed but with certification-based replication the master should not need commit confirmation back from the slaves. It just needs confirmation of successful atomic broadcast to the network followed by local certification, i.e., confirming it does not deadlock locally. Perhaps Alexey or Seppo could comment on this.
Ah, forgot some other things:
I agree with you on the use case where there is a small disconnect in replication: it should be possible to just replay the lost transactions, not having to do a full copy. Does this mean Galera does not write anything like the binary log? Since they have global transaction id’s, a disconnected slave could just ask to get all transactions after a certain id. Of course, xtrabackup incremental backup is actually doing the same thing, just from innodb log…
For performance penalty, it is an interesting question how do you compare apples to apples here: Alex will argue that the proper comparison is single node with both innodb_flush_log_at_trx_commit=1 and sync_binlog=1, and two Galera nodes with both of those set to zero – as this will give comparable durability. I tend to agree, but those that are already used to relaxing durability may not care (but why then would they want synchronous replication at all?) So comparing performance with Galera against single node case you cannot avoid this discussion, and probably need to compare with different settings to make it useful.
(It also makes a difference if you use an InnoDB version with group commit bug fixed or not. If you have the group commit bug, then it is easy to get much better performance just because the binlog goes away. See http://openlife.cc/blogs/2011/may/drbd-and-semi-sync-shootout-large-server)
Hi Vadim
Great that you are finally looking into this. I’m personally also very excited about this technology and have been following it for some time now. I’ll see if I can do some tests of my own in the near future too, then we could compare experiences.
A few comments I can immediately share:
Too bad you find it difficult to setup. One thing I was hoping to see was that with the theoretical framework being more sound, also the usability would improve over MySQL replication. I’ll share my own opinion once I touch it.
There’s one gotcha you are not aware of here, I learned this when attending their session at MySQL conference. Galera is not really 100% fully synchronous (like NDB) as one would expect from how they describe it. It is something between MySQL semi-sync and fully synchronous. This is what happens:
– Client executes statements on one node (the master)
– Client issues commit
– Transaction is replicated to other nodes (the slaves)
– Slaves perform some verification to guarantee that transaction can be committed without conflicts
– Slaves return the commit
– Master returns commit to client
– Slaves apply the transaction to InnoDB tablespace
The verification done by slaves is supposed to guarantee that the transaction can be applied to the tablespace, but the transaction is not yet visible if you read from the slave – there is a theoretical possiblity for race conditions (depending on application design). However, it is more solid than MySQL semi-sync in that the transaction is guaranteed to apply to InnoDB tablespace – if this fails, the node must shut down.
This results in two issues to follow up on:
– It becomes an obvious task to review and test the code where slaves approve transactions for commit. Are auto_increment and other gotchas handled properly?
– Implement a solution that makes the race condition go away. Alex has a design he calls “causal read mode” where you keep the current design but when starting a select, a marker is appended into the queue of committed transactions still being applied, and the read is only done once transactions up until the marker were applied to the tablespace. I have also thought of a different solution, where after commit you could do something like last_transaction_id() (think like last_insert_id()), and you could then pass that id with your next select, telling Galera that you only want to read data when that transaction id was applied.
Of course if the application thread keeps the connection to the same node, then on that node the previous transaction is always committed. (For other application threads, it will usually be enough that any slave lag remains “small”.)
Finally, I’m interesting to find out if there are workloads that work well on single InnoDB node, but result in high number of rollbacks with Galera. The issue here is that if I have 2 transactions that try to write to same rows on a single InnoDB instance, then locks will be taken on those rows, and possibly the other transaction will succeed in writing to the same rows after waiting for the locks for a while. On the other hand if I use Galera and these transactions are run against separate nodes, there won’t be any locks, but instead both transactions are allowed to proceed with their statements. It is only at commit time that conflicts will be detected, and at this point there may not be any other solution but to roll back one of the transactions. (I suppose this is a variation of optimistic vs pessimistic locking problem.)
Andy.
1. If node fails – query that is running to that node will fail, like with failure of regular mysqld.
You may add node health detection to application, or have load balancer that route traffic to working nodes.
2. No idea on handlersocket. At least if was not designed with handlersocket in mind, but it may work by co-incidence
This sounds nice.
– How does failover work in Galera? If 1 node fail, would a writing node keep waiting for responses from the failed node? Would future requests be automatically routed around the failed node?
– Does Galera work with handlersocket?
I setup a high availability mysql cluster with mariadb galera and wrote about it here if anybody needs a hand setting it up. http://jmoses.co/2014/03/18/setting-up-a-mysql-cluster-with-mariadb-galera.html