April 23, 2014

5 reasons why MySQL replication lag is flapping between 0 and XXXXX

nagios-delay Working day to day with Percona Remote DBA customers, we have been facing an issue from time to time when MySQL replication lag is flapping between 0 and XXXXX constantly – i.e. Seconds_Behind_Master is 0 for a few secs, then it’s like 6287 or 25341, again 0 and so on. I would like to note the 5 different scenarios and symptoms of this – some are obvious and some are not really.

1. Duplicate server-ids on two or more slaves.

Symptoms: MySQL error log on a slave shows the slave thread is connecting/disconnecting from a master constantly.
Solution: check whether all nodes in the replication have unique server-ids.

2. Dual-master setup, “log_slave_updates” enabled, server-ids changed.

Scenario: you stop MySQL on the first master, then you stop the second one. Afterwards, you perform some migration or maintenance. Suddenly, you realize that it would be better to change server-ids to match the end part of new IP addresses. You bring everything back online and noticed the strange behaviour like some portion of binlogs cycle between masters due to log_slave_updates enabled. But how come if new server-ids are different? It’s really could be true when the some data were written to the second master prior its shutdown when the first one was already off and both nodes started recognizing cycled data as not their own, thus applying them and passing down the replication as sender’s server-id does not match server-id written in the binlogs data itself prior shutdown. So we have got an infinite loop.
Solution: simply reset the slave position on one of the masters a new, so finally the “ghost” binlogs will stop cycling.

3. MySQL options “sync_relay_log”, “sync_relay_log_info”, “sync_master_info”.

Symptoms: according to the “SHOW SLAVE STATUS” output, one time the slave thread queues events showing some delay, another time waits for master showing 0 lag and so on. Considering the real master position, indeed there should be XXXXX delay. Another symptoms are IO saturation and high disk IOPS number but a disk is only busy a half as shown by pt-diskstats. In my case I was observed 1500 iops on master, 10x more – 15000 iops being 60% busy on slave. You may think it could be a row-based replication (binlog_format=ROW) and constant updates on a master but still why the delay is flapping and disk IO is high but not on 100% of capacity?
Possible reason: check whether you do not have enabled sync_relay_log, sync_relay_log_info, sync_master_info which make slave to sync data to disk on every event, e.g. sync_relay_log ensures one write to the relay log per statement if autocommit is enabled, and one write per transaction otherwise. Enabling those options makes your system slow unless the disk has a battery-backed cache, which makes synchronization very fast.

4. Network latency.

Symptoms: there is a poor network link between master and slave. With saturated link the slave IO_thread will have a hard time keeping up with master. SQL thread reports zero lag while it just waits for more events stored by IO_thread. Once IO_thread is able to write more into relay log, SQL_thread will report XXX seconds behind again.
Solution: check the network throughput.

5. Late committed transactions.

Scenario: if you open a transaction on master and then for some reason (maybe application bug or frontend problems) you commit it some time later, it will show it as replication latency on the slave. So for example, if you wait one hour before commit, it will show 3600 seconds of latency on the slave while that transaction is being executed.
Solution: use pt-heartbeat tool.

About Roman Vynar

Roman is a Remote DBA in Percona. His duties include handling remote DBA ops, developing and improving of monitoring tools. He also maintains Percona Monitoring Plugins project.

Comments

  1. Rolf says:

    Can you explain how simply resetting the master slave position in #2 will fix the problem? What about missed statements that weren’t executed after the skip?

  2. Roman Vynar says:

    Rolf, I didn’t write that there are missed statements and actually, there were no such ones as was proven by running the table checksumming further, also the app was turned off. It was just my assumption of what it could be…

  3. Rolf says:

    Roman, and it was an excellent summary. With given “solutions” it can be dangerous for those who are looking at how to fix an issue they have.

    “Enabling those options makes your system slow unless the disk has a battery-backed cache, which makes synchronization very fast.” MySQL recognizes the fact that you have a battery backed cache?

  4. Ernie Souhrada says:

    @Rolf–

    MySQL itself has no clue as to whether or not your machine has a battery-backed write cache; that all happens at the OS level and below. A RAID controller with BBWC (and write-back caching enabled – write-through caching is a performance killer) is generally going to return that a write or an fsync has completed back to the OS as soon as it hits controller memory rather than waiting for the bits to physically hit the disk. Hence, much higher throughput.

  5. Neil says:

    I think I’ve seen this issue in the past when the server time was set incorrectly when mysql started, and then updated after mysql had been running. The replication lag would flap between 0 and the time difference, say, 3600 if the adjustment was 1 hour.

    Restarting mysql would fix the problem.

  6. Jon says:

    Could you go into more depth as to why, and how #5 occurs?

  7. Jo Valerio says:

    Had this kind of issue before and I’ve been wrong with sysads and network engineers thinking that there’s an ongoing network issue. it ends up that the setup of a junior dba copied the config from another slave without changing the server_id.

    great article that summarized all common issues about replication lag

Speak Your Mind

*