One common theme in the questions our MySQL Support customers ask is MySQL Replication Lag. The story is typically along the lines everything is same as before and for some unknown reason the slave is started to lag and not catching up any more. I always smile at “nothing has changed” claim as it usually wrong, and one should claim I’m not aware of any change. Digging deeper we would find some change, though often subtle as computers are state machines and with no external and internal changes they behave the same way. First let me start pointing out common causes of replication lag

Hardware Faults Hardware faults may cause reduced performance causing replication to lag. Failed hard drive getting RAID in degraded mode could be one common example. Sometimes it is not fault per say but operating mode change may cause the problem. Many RAID controllers have to periodically go through battery learning cycle which switches RAID cache in write-through mode which can cause serve performance degradation.

Configuration Changes Have you changed MySQL configuration ? OS Configuration or Hardware configuration or could have anyone changed it without your knowledge ? If yes this could cause problems with replication lag (and performance in general). Typical problems for replication include setting sync_binlog=1, enabling log_slave_updates, setting innodb_flush_log_at_trx_commit=1. Also note there is no such thing as safe change. Even such “it can only do good” change as increasing buffer pool size could cause performance problems.

MySQL Restart MySQL Restarts may include recovery time but what is the most important they often come with a need of warmup, because buffer pool content is lost on restart. Many people understanding this do not quite recognize how long it takes, and by the fact removing slave read traffic from non warmed up slave will often cause warmup to be longer, because it has to happen in the single thread. In the worse case scenario you might be looking at some 100 pages per second which is 1.6MB/sec which may take 24+hours to feel 200GB buffer pool. Automatic Buffer Pool Save/Restore available in Percona Server is great help in this case.

Changing Slave Role Changing Slave role, such as swapping active and passive masters in Master-Master setup is often cause of the problem for similar reason as MySQL Restart – the traffic on different boxes might be different and hence cache content.

Replication Traffic Changes Changing in the replication traffic (what updates are being replicated) are frequent cause of concern. Remember replication runs as a single thread and something which might not look like a lot of load for your powerful master may be enough to take replication overboard. Application changes and Traffic increases are frequent causes of problems but more subtle things such as “small” batch job being ran which does a lot of updates can also be the problem. It is nice to analyze your replication traffic regularly with mk-query-digest (Percona Server marks Slave Thread in slow query log so it is easy to filter out events) so you can see whenever there are any suspicious events in the replication traffic. An interesting thing to note – scripts can hurt replication traffic well after they are ended. I’ve seen cases when overly eager batch job doing updates from 20 threads on the master will create enough updates for a few hours, to make replication lag for over 24 hours until they all could be done with.

Slave Read Traffic Changes Slave read traffic impacts replication performance dramatically and even if nothing has changed to the replication traffic, changes to concurrent traffic on the slave can affect replication capacity a lot. Checking traffic for changes with mk-query-digest is a great way to deal with it. Note you need not only to check the types and frequencies of queries but also how much IO they are causing and how many rows they analyzes. Changes could be in the data which is being accessed by the queries or query execution plans.

Side load on IO subsystem Could there be any side load on IO subsystem ? This can take many forms. It can be you running backup through xtrabackup or LVM snapshot, RAID controller doing RAID validation, some centrally kicked of SAN backup, intensive activity on the other tenants sharing physical hardware with you in SAN, NAS or Cloud installations. It all can be the cause of the problems. In case the side load is going through the same operating system you should be able to see it by comparing IO MySQL is doing to IO going to devices from OS standpoint. If this is external and not recorded by operating system you often can see it as higher IO load (IO utilization, response time) while less work is being done in terms of requests per second. Should be easy to see on the graphs. If you have not yet installed something for graphing check out Cacti Tempates which have a lot of such goodness.

Data Size ChangesThe more data you have the more IO bound workload is likely to be and also it is possible queries become more complicated scanning more rows than before. Data growth is frequent cause of performance problems and slave lag among them. The relationship is also very workload dependent – some may see only modest slowdown when data doubles, while in other case 10-20% difference in data size can be the difference between slave never lags and slave never is able to catch up.

MySQL or OS version changesMySQL version changes (even minor upgrade) may introduce issues which causes replication lag. Same may happen with operating system and other system software involved. In could be even system which queries mysql. I’ve seen monitoring updates which would add excessively expensive and frequent query on information_schema which caused a lot of issues. Note there is a lot of complicated software which runs on devices themselves now, so watch for things like firmware updates bios updates etc, which even though not frequent may be cause of the issues.

Finally – do not be mistaken, it is not always just one issue which is responsible for everything. Quite often it is many things which come together to form the perfect storm – just a bit of data growth which just a bit different workload and when backup taken at the wrong time may be responsible for unexpected MySQL replication lag.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Avi Kapuya

This comment has been removed by the blog administrator as off-topic trolling. Future comments from this author will be moderated.

James Cohen

Is it worth adding “checking that master/slave system clocks are synced” in there?!

Rajaram Satyanarayanan

Hi Peter,
Thank you for this valuable check list for replication lag. I do have some clarifications on configuration and network . On my.cnf config binary log format is a key factor . Any specific reason it has been omitted. Also could you please share your experience related to 5.6 semi sync replication and newer multi threaded implementation of slave threads.

Thanks,
Satya

NSB

Hi Peter,

I have problem in Mysql Master to Master Replication
Master A = In production Environment
Master B = AT DR site

Changes from Master A to B are not replicated while changes from Master B to A are replicated. Master A is in production environment and is in use, and has just 2 relay log files while Master B upto yesterday was working and all of sudden its not moving ahead (not updating), it is stuck in on position. When i show slave status\G; it IO and SQL theread is yes and the status is waiting for master to send event, But no changes are being replicated. This is a problem for me. How Can i get rid of this and one thing more there is too many relay log files in Master B /var/lib/mysql directory.

Jonathan Valerio

I have an issue with replication behind before and I end up determining that the cause of unstable value for Seconds_Behind_Master is having both slaves server-id similar.

james ransom

I have noticed that lag can happened out of the blue as this article points out it can happen for many reasons. You can use php to detect how far the machine is behind and send yourself a text message. Here is the code:

http://www.jamesransom.net/?p=196

Chan Kim

Also noticed that slave starts reporting incorrect Seconds_Behind_Master value if system time changes on the slave. It is not much matter if the time decreases. But if increased, replication thread reports wrong values. A quick fix was to restart the replication, or just never let system time drift away too much.

John Deliver

Thanks for your good work, Peter Zaitsev! So, what is the solution for this issue?

Drew Daniels

Any ideas about occationaly having it be off by 14399 seconds? I know in 5.0 there were some occational incorrect values that were fixed.

Rishav Rastogi

Now with Mysql 5.6 we have multi-threaded support for slaves, how will this change the lag situation ?

Boris

Hello. I my case really help innodb_flush_log_at_trx_commit=2
In my slave parameter second_behind_master=173000 and this different is gone.