Finding Common Reasons for MySQL Replication Lag

One common theme in the questions our MySQL Support customers ask is MySQL Replication Lag. The story is typically along the lines everything is same as before and for some unknown reason the slave is started to lag and not catching up any more. I always smile at “nothing has changed” claim as it usually wrong, and one should claim I’m not aware of any change. Digging deeper we would find some change, though often subtle as computers are state machines and with no external and internal changes they behave the same way. First let me start pointing out common causes of replication lag

Hardware Faults Hardware faults may cause reduced performance causing replication to lag. Failed hard drive getting RAID in degraded mode could be one common example. Sometimes it is not fault per say but operating mode change may cause the problem. Many RAID controllers have to periodically go through battery learning cycle which switches RAID cache in write-through mode which can cause serve performance degradation.

Configuration Changes Have you changed MySQL configuration ? OS Configuration or Hardware configuration or could have anyone changed it without your knowledge ? If yes this could cause problems with replication lag (and performance in general). Typical problems for replication include setting sync_binlog=1, enabling log_slave_updates, setting innodb_flush_log_at_trx_commit=1. Also note there is no such thing as safe change. Even such “it can only do good” change as increasing buffer pool size could cause performance problems.

MySQL Restart MySQL Restarts may include recovery time but what is the most important they often come with a need of warmup, because buffer pool content is lost on restart. Many people understanding this do not quite recognize how long it takes, and by the fact removing slave read traffic from non warmed up slave will often cause warmup to be longer, because it has to happen in the single thread. In the worse case scenario you might be looking at some 100 pages per second which is 1.6MB/sec which may take 24+hours to feel 200GB buffer pool. Automatic Buffer Pool Save/Restore available in Percona Server is great help in this case.

Changing Slave Role Changing Slave role, such as swapping active and passive masters in Master-Master setup is often cause of the problem for similar reason as MySQL Restart – the traffic on different boxes might be different and hence cache content.

Replication Traffic Changes Changing in the replication traffic (what updates are being replicated) are frequent cause of concern. Remember replication runs as a single thread and something which might not look like a lot of load for your powerful master may be enough to take replication overboard. Application changes and Traffic increases are frequent causes of problems but more subtle things such as “small” batch job being ran which does a lot of updates can also be the problem. It is nice to analyze your replication traffic regularly with mk-query-digest (Percona Server marks Slave Thread in slow query log so it is easy to filter out events) so you can see whenever there are any suspicious events in the replication traffic. An interesting thing to note – scripts can hurt replication traffic well after they are ended. I’ve seen cases when overly eager batch job doing updates from 20 threads on the master will create enough updates for a few hours, to make replication lag for over 24 hours until they all could be done with.

Slave Read Traffic Changes Slave read traffic impacts replication performance dramatically and even if nothing has changed to the replication traffic, changes to concurrent traffic on the slave can affect replication capacity a lot. Checking traffic for changes with mk-query-digest is a great way to deal with it. Note you need not only to check the types and frequencies of queries but also how much IO they are causing and how many rows they analyzes. Changes could be in the data which is being accessed by the queries or query execution plans.

Side load on IO subsystem Could there be any side load on IO subsystem ? This can take many forms. It can be you running backup through xtrabackup or LVM snapshot, RAID controller doing RAID validation, some centrally kicked of SAN backup, intensive activity on the other tenants sharing physical hardware with you in SAN, NAS or Cloud installations. It all can be the cause of the problems. In case the side load is going through the same operating system you should be able to see it by comparing IO MySQL is doing to IO going to devices from OS standpoint. If this is external and not recorded by operating system you often can see it as higher IO load (IO utilization, response time) while less work is being done in terms of requests per second. Should be easy to see on the graphs. If you have not yet installed something for graphing check out Cacti Tempates which have a lot of such goodness.

Data Size ChangesThe more data you have the more IO bound workload is likely to be and also it is possible queries become more complicated scanning more rows than before. Data growth is frequent cause of performance problems and slave lag among them. The relationship is also very workload dependent – some may see only modest slowdown when data doubles, while in other case 10-20% difference in data size can be the difference between slave never lags and slave never is able to catch up.

MySQL or OS version changesMySQL version changes (even minor upgrade) may introduce issues which causes replication lag. Same may happen with operating system and other system software involved. In could be even system which queries mysql. I’ve seen monitoring updates which would add excessively expensive and frequent query on information_schema which caused a lot of issues. Note there is a lot of complicated software which runs on devices themselves now, so watch for things like firmware updates bios updates etc, which even though not frequent may be cause of the issues.

Finally – do not be mistaken, it is not always just one issue which is responsible for everything. Quite often it is many things which come together to form the perfect storm – just a bit of data growth which just a bit different workload and when backup taken at the wrong time may be responsible for unexpected MySQL replication lag.

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Avi Kapuya

12 years ago

This comment has been removed by the blog administrator as off-topic trolling. Future comments from this author will be moderated.

James Cohen

12 years ago

Is it worth adding “checking that master/slave system clocks are synced” in there?!

Peter Zaitsev

Author

12 years ago

James,

Yeah. Though there is whole set of different issues when you might see replication is lagging by looking at Seconds_Behind_Master when it is not and what is more important when it shows as there is no lag while there is one. This list is mainly about when lag is really happening 🙂

Rajaram Satyanarayanan

12 years ago

Hi Peter,
Thank you for this valuable check list for replication lag. I do have some clarifications on configuration and network . On my.cnf config binary log format is a key factor . Any specific reason it has been omitted. Also could you please share your experience related to 5.6 semi sync replication and newer multi threaded implementation of slave threads.

Thanks,
Satya

NSB

12 years ago

Hi Peter,

I have problem in Mysql Master to Master Replication
Master A = In production Environment
Master B = AT DR site

Changes from Master A to B are not replicated while changes from Master B to A are replicated. Master A is in production environment and is in use, and has just 2 relay log files while Master B upto yesterday was working and all of sudden its not moving ahead (not updating), it is stuck in on position. When i show slave status\G; it IO and SQL theread is yes and the status is waiting for master to send event, But no changes are being replicated. This is a problem for me. How Can i get rid of this and one thing more there is too many relay log files in Master B /var/lib/mysql directory.

Jonathan Valerio

12 years ago

I have an issue with replication behind before and I end up determining that the cause of unstable value for Seconds_Behind_Master is having both slaves server-id similar.

james ransom

11 years ago

I have noticed that lag can happened out of the blue as this article points out it can happen for many reasons. You can use php to detect how far the machine is behind and send yourself a text message. Here is the code:

http://www.jamesransom.net/?p=196

Chan Kim

11 years ago

Also noticed that slave starts reporting incorrect Seconds_Behind_Master value if system time changes on the slave. It is not much matter if the time decreases. But if increased, replication thread reports wrong values. A quick fix was to restart the replication, or just never let system time drift away too much.

John Deliver

11 years ago

Thanks for your good work, Peter Zaitsev! So, what is the solution for this issue?

Drew Daniels

10 years ago

Any ideas about occationaly having it be off by 14399 seconds? I know in 5.0 there were some occational incorrect values that were fixed.

Rishav Rastogi

9 years ago

Now with Mysql 5.6 we have multi-threaded support for slaves, how will this change the lag situation ?

Boris

8 years ago

Hello. I my case really help innodb_flush_log_at_trx_commit=2
In my slave parameter second_behind_master=173000 and this different is gone.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Reasons for MySQL Replication Lag

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Reasons for MySQL Replication Lag

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation