I was curious to check how Percona XtraDB Cluster behaves when it comes to MySQL replication latency — or better yet, call it data propagation latency. It was interesting to see whenever I can get stale data reads from other cluster nodes after write performed to some specific node. To test it I wrote quite a simple script (you can find it in the end of the post) which connects to one node in the cluster, performs an update and then immediately does the read from second node. If the data has been already propagated — good, if not we’ll continue to retry reads until it finally propagates, and then measure the latency. This is used to see whenever application can see any stale reads.
My setup is 3 Percona XtraDB Cluster nodes talking through dedicated 1Gbit cluster network (DPE1, DPE2, DPE3) and I’m running the test from 4th server (SMT2) so it is pretty realistic setup from typical data center latency point of view though the server hardware is not the most recent.
First lets look at baseline when cluster has no load but running the script doing writes to DPE1 and immediately reading from DPE2
1 | Summary: 94 out of 10000 rounds (0.94%) Delay distribution: Min: 0.71 ms; Max: 2.16 ms Avg: 0.89 ms |
These results tell me 2 things. First Replication by default in Percona XtraDB Cluster is Asynchronous from Data Propagation Standpoint – it takes time (though short one in this case) for changes committed on the one node to become visible to the other. Second – it is actually doing quite well with less than 1% of tests able to see any inconsistency and the delay being less than 1ms in average with rather stable results.
But we do not setup clusters to be idle right ? So lets to another test, now running the Sysbench load on DPE1. With concurrency of 32 this corresponds to pretty significant load.
1 | sysbench --test=oltp --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=32 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=3000 run |
Results become as follows:
1 2 | Summary: 3901 out of 10000 rounds (39.01%) Delay distribution: Min: 0.66 ms; Max: 201.36 ms Avg: 3.81 ms Summary: 3893 out of 10000 rounds (38.93%) Delay distribution: Min: 0.66 ms; Max: 42.9 ms Avg: 3.76 ms |
As expected we can observe inconsistency much more frequently in almost 40% though the average observed delay remains just few milliseconds, which is something most applications would not even notice.
Now if we run sysbench on DPE2 (the load on the node which we’re reading from)
1 2 | Summary: 3747 out of 10000 rounds (37.47%) Delay distribution: Min: 0.86 ms; Max: 108.15 ms Avg: 8.62 ms Summary: 3721 out of 10000 rounds (37.21%) Delay distribution: Min: 0.81 ms; Max: 291.81 ms Avg: 8.54 ms |
We can observe the effect in similar amount of cases but delay is higher in this case both in average and the Maximum one. This tells me from data propagation standpoint the cluster is more sensitive to the load on the nodes which receive the data, not the ones where writes are done.
Lets remember though what Sysbench OLTP has only rather small portion of writes. What if we look at workloads which consists 100% of Writes. We can do it with Sysbench, for example:
1 | sysbench --test=oltp --oltp-test-mode=nontrx --oltp-nontrx-mode=update_key --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=32 --init-rng=on --max-requests=0 --max-time=3000 run |
Running this load on DPE1 I’m getting:
1 2 | Summary: 1062 out of 10000 rounds (10.62%) Delay distribution: Min: 0.71 ms; Max: 285.07 ms Avg: 3.21 ms Summary: 1113 out of 10000 rounds (11.13%) Delay distribution: Min: 0.81 ms; Max: 275.94 ms Avg: 5.06 ms |
Surprise! results are actually better than if we put mixed load as we can observe any delay only in about 11%.
However if we run the same side load on DPE2 we get:
1 2 | Summary: 5349 out of 10000 rounds (53.49%) Delay distribution: Min: 0.81 ms; Max: 519.61 ms Avg: 5.02 ms Summary: 5355 out of 10000 rounds (53.55%) Delay distribution: Min: 0.81 ms; Max: 526.95 ms Avg: 5.06 ms |
Which is the worse result out there with over 50% samples produced inconsistent data and average delay for those over 5ms and outliers going to half a second.
From these results I read the side load on the node TO which updates are being propagated causes largest delay.
At this point I remembered there is one more test I can run. What is if I put side load on DPE3 server, from which I’m not touching from the test at all ?
1 | Summary: 833 out of 10000 rounds (8.33%) Delay distribution: Min: 0.66 ms; Max: 353.61 ms Avg: 2.76 ms |
No surprise here as DPE3 is not being directly read or written to the load on it should cause minimal delays to data propagation from DPE1 to DPE2.
The propagation latency we’ve observed in the test so far is quite good but it is not synchronous replication behavior – we still can’t treat the cluster as if it were single server from generic application. Right. Default configuration for Percona XtraDB Cluster at this point is to replicate data asynchronously, but still guaranty there is no conflicts and data inconsistency then updates are done on multiple nodes. There is an option you can enable to get fully synchronous replication behavior:
1 2 | mysql> set global wsrep_causal_reads=1; Query OK, 0 rows affected (0.00 sec) |
When this option is enabled the cluster will wait for the data to be actually replicated (committed) before serving the read. The great thing is wsrep_causal_reads is session variables so you can mix different applications on the same cluster – some requiring better data consistency guarantees, other being OK with a little bit stale data but looking for the best performance possible.
So far so good. We can make cluster to handle significant load with small transactions and still have very respectful data propagation delay or we can enable wsrep_causal_reads=1 option and get full data consistency. But what happens if we have some larger transactions ? To test this I have created the copy of sbtest table and will run a long update while running my test to see how the latency is impacted:
1 2 3 | mysql> update sbtest2 set k=k+1; Query OK, 1000000 rows affected (1 min 14.12 sec) Rows matched: 1000000 Changed: 1000000 Warnings: 0 |
Running this query on the DPE1 box I’m getting following result:
1 2 3 4 5 6 7 8 9 10 | ... Result Mismatch for Value 48; Retries: 1 Delay: 0.76 ms Result Mismatch for Value 173; Retries: 1 Delay: 1.21 ms Result Mismatch for Value 409; Retries: 1 Delay: 0.86 ms Result Mismatch for Value 460; Retries: 142459 Delay: 46526.7 ms Result Mismatch for Value 461; Retries: 65 Delay: 22.92 ms Result Mismatch for Value 464; Retries: 1 Delay: 0.71 ms Result Mismatch for Value 465; Retries: 1 Delay: 0.76 ms ... Summary: 452 out of 10000 rounds (4.52%) Delay distribution: Min: 0.66 ms; Max: 46526.7 ms Avg: 104.28 ms |
So the propagation delay was pretty good until this given query had to be replicated, in which case we could observe the replication delay for over 45 seconds which is quite nasty.
Note though delay was for less period than it takes to execute the query on the master. This is because application of the changes on the master in parallel and updates to the sbtest table and sbtest2 table can be done in parallel (even changes to the same table can) but the certification process is serial as well as sending write set to the other nodes, and it must be taking some 45 seconds to send the write set and perform certification.
If we run the same query on DPE2 the interesting thing happens. The script does not show any data propagation delays but it visibly stalls, as I guess because the UPDATE statement issued to DPE1 is blocked for some time. To check this idea I decided to use the sysbench script with very simple point update queries to see if we get any significant stalls. My base run on DPE1 is as follows:
1 2 3 4 5 6 7 8 | root@dpe01:/etc/mysql# sysbench --test=oltp --oltp-auto-inc=off --oltp-test-mode=nontrx --oltp-nontrx-mode=update_key --mysql-user=root --mysql-password="" --oltp-table-size=1000000 --num-threads=1 --init-rng=on --max-requests=0 --max-time=300 run .... per-request statistics: min: 0.68ms avg: 0.88ms max: 306.80ms approx. 95 percentile: 0.94ms .... |
We can see quite respectful performance with longest request taking some 300ms – so no stalls. Lets do the run again now running the same update statement on the different cluster node:
1 2 3 4 5 | per-request statistics: min: 0.69ms avg: 1.12ms max: 52334.76ms approx. 95 percentile: 0.97ms |
As we see there is a stall in update for 50+ second, again while certification is happening. So certification does not only delay data propagation but can stall updates done to the different tables on the different nodes.
Summary:
Percona XtraDB Cluster performs very well when it comes to small transactions offering very small propagation delay and an option of synchronous replication all together. However when it comes to large transactions you can get in a lot of trouble with major stalls both in terms of data propagation and in terms of writes. The system I did test on is pretty old and I would expect modern systems can run certification several times faster still taking tens of seconds for what I would consider medium size transaction modifying 1 million of rows is rather long time. So make sure to have a good understanding how large transactions your application has and how longs stalls it can handle.
Appendix:
As promised the script I was using for testing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | # The idea with this script is as follows. We have 2 nodes. We write to one node and when read from second node # To see whenever we get the same data or different $writer_host="dpe01"; $reader_host="dpe02"; $user="test"; $password="test"; $table="test.sbtest"; $increment=2; $offset=1; $max_id=1000; $rounds=10000; $writer=new mysqli($writer_host,$user,$password); $reader=new mysqli($reader_host,$user,$password); $total_delay=0; $min_delay=100000000; $max_delay=0; $delays=0; $sum_delay=0; for($val=0; $val<$rounds;$val++) { $id=rand(1,$max_id); $id=floor($id/$increment)*$increment+$offset; $writer->query("UPDATE $table set k=$val where id=$id"); $tw=microtime(true); /* Loop while we get the right result */ $retries=0; while(true) { $result=$reader->query("SELECT k from $table where id=$id"); $row=$result->fetch_row(); if ($row[0]!=$val) $retries++; else { $tr=microtime(true); break; } $result->close(); } if ($retries!=0) /* If we had to retry compute stats */ { $delay=round(($tr-$tw)*1000,2); $delays++; $sum_delay+=$delay; $min_delay=min($min_delay,$delay); $max_delay=max($max_delay,$delay); echo("Result Mismatch for Value $val; Retries: $retries Delay: $delay msn"); } } if ($delays>0) $avg_delay=round($sum_delay/$delays,2); else $avg_delay=0; $delay_pct=round($delays/$val*100,3); echo("Summary: $delays out of $val rounds ($delay_pct%) Delay distribution: Min: $min_delay ms; Max: $max_delay ms Avg: $avg_delay msn"); |
Appendix2: Percona XtraDB Cluster related configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 | # PXC Settings for Version: '5.5.29-55-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.2.r3843 wsrep_node_address=10.9.9.1 wsrep_provider=/usr/lib/libgalera_smm.so wsrep_cluster_address=gcomm://10.9.9.1,10.9.9.2,10.9.9.3 #wsrep_cluster_address=gcomm:// wsrep_slave_threads=8 wsrep_sst_method=xtrabackup wsrep_cluster_name=DPE binlog_format=ROW default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 innodb_locks_unsafe_for_binlog=1 wsrep_sst_auth=root: |
Are there any benchmarks showing the difference on TPS or commit time for setting wsrep_causal_reads to 0 or 1? I did not see any benchmark stating which values for wsrep_causal_reads are used, including the benchmarks done by percona.
Tobias,
I have not seen much. This is one thing I was going to take a look at 🙂
Hi Peter, thanks so much for this article. I was planning to do exactly the same sort of test as part of preparing for my multi-master talk at the Percona Live conference. This behavior, especially for long transactions, is very much expected so it’s great to see some proof from such a simple benchmark. Another interesting thing to try is generating distributed deadlock through a table hotspot that creates a rollback when you update simultaneously across multiple nodes. I would expect this to become a bigger problem with workloads that (a) have a lot of updates on a small set of rows and (b) as the overall load increases, since this makes the deadlock window bigger when update propagation is delayed in any way. I’ll post my results unless you beat me to it.
Hi Peter,
Great you’re doing tests on the intricacies of Galera, we use it in production, and overall very satisfied with it, but more use cases and advanced testing is needed.
Could you clarify your meaning in your statement,
“First Replication by default in Percona XtraDB Cluster is Asynchronous from Data Propagation Standpoint – it takes time (though short one in this case) for changes committed on the one node to become visible to the other.”
I want to confirm you are you saying your test is showing a delay simply between the local commit (not visibility) of the update on one node vs. the visibility on the second node (all nodes however would read exactly the same lag of commit -> visibility when the certification and then cluster wide commit is issued).
Or are you saying it shows data inconsistency between the two nodes in terms of the actual visibility (inconsistency in the cluster wide commit)?
@Mike, what do you mean by a cluster-wide commit? Galera just ensures the transactions get to the DBMS nodes, which certify and commit independently. In this case you can get some differences between when transactions show up on different nodes. In fact, it should be possible for transactions to commit and become visible on replicas before they are visible on the originating master if the master commits more slowly. This could happen if there are differences in the speed of file systems across hosts. (Another interesting case to test.)
Mike – what I mean is what there is the delay between data is committed at DPE1 and it become visible at DPE2. I have not tested it but I assume from architecture design there is no guaranty what it will be visible at DPE2 and DPE3 at the same time.
If you need to ensure you never read stale data you should be using wsrep_causal_reads=1
Thanks Peter, so indeed you were referring to actual visibility differences on the nodes, when you said “..Asynchronous from Data Propagation Standpoint…” I thought perhaps you meant the time from when the update was first issued (including the time to replicate over and certify).
Robert – when I said cluster-wide commit, I meant at the point at which a global transaction ID is issued, then certification takes place, when I said local commit, I meant when the statement was initiated (meant to write ‘local commit issuance’) just before replication and certification. No argument the commits happen independently (good clarification to bring up) and it’s possible for them to happen at different times, just trying to figure out what was being measured here.
@Mike, thanks for the clarification. One of the issues with Galera overall is that the terminology is sometimes not very clear from the user docs, though of course it’s pretty clear in the heads of the Codership developers. Cheers, Robert
Peter,
The script got totally broken,
so there is no good way to copy-paste it.
Vadim,
Ah… Still the highlighting plugin problems. I have changed it to blockquote hope it is easy now
If you’re interested in using Percona xtraDB Cluster for production, I would highly recommend you use something else instead. Trust me on this, it will cost you dearly.
Hi Peter,
Nice article. Recently we shifted on following new dedicated server. From first day onwards, we were facing the performance issue. Still MYSQL load was very high but site was not going down.
Then we did some R&D on Percona MYSQL server and found that Percona need some configuration setting regarding to memory allocation. So we did some memory allocation setting in my.cnf file. On first day after doing memory allocation setting in my.cnf file, site performance was good.
But from second day again performance degradation started and now it is slow again. MySQL load is not going very high. We have also noticed that many process stay in sleeping condition for long time like 200-300 processes.
Current Server Details (Dedicated Server):
Software:
Apache Version – 2.2.15 (CentOS)
PHP Version – 5.3.3
MySQL – 5.1.66 [Percona XtraDB Cluster (GPL) (5.5.30)] (InnoDB Engine)
PHP My Admin – 5.3.3
Hardware:
Node:2
CPU = Quard Core
RAM = 12 GB
HDD = 1 TB
IP Addresses = 2
Please help me if there is something we missed.
Manish,
Blog post comment is not the best place for asking questions specific to your case. I’d encourage you to ask the question on our forums instead https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster
Jason – Could you clarify what you mean? Why will it cost dearly and what would you propose instead? Percona Server?
In my experience all Percona products are quite stable and are pretty well supported (even if you don’t have the premium consulting services)
We are providing services echo-investigations for business or wedding, associate people qualifications or quality.
Doing Sysbench test with Standalone Mysql Server we got trasactions/sec 3000 & with percona cluster we got 400/sec. Is it expected?