May 19, 2013

Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

If you ever had a replication slave that is severely behind, you probably noticed that it’s not catching up with a busy master at a steady pace. Instead, the “Seconds behind master” is going up and down so you can’t really tell whether the replica is catching up or not by looking at just few samples, unless these are spread apart. And even then you can’t tell at a glance when it is going to catch up.

Normally, the “severely behind” thing should not happen, but it does often happen in our consulting practice:

  • sometimes replication would break and then it needs to catch up after it is fixed,
  • other times new replication slave is built from a backup which is normally hours behind,
  • or, it could be that replication slave became too slow to catch up due to missing index

Whatever the case is, single question I am being asked by the customer every time this happens is this: When is the replica going to catch up?”

I used to tell them “I don’t know, it depends..” and indeed it is not an easy question to answer. There are few reasons catching up is so unstable:

  1. If you have restarted the server, or started a new one, caches are cold and there’s a lot of IO happening,
  2. Not all queries are created equal – some would run for seconds, while others can be instant,
  3. Batch jobs: some sites would run nightly tasks like building statistics tables or table checksum – these are usually very intense and cause slave to backup slightly.

I didn’t like my own answer to The question, so I decided to do something about it. And because I love awk, I did that something in awk:

I don’t know if this is ever going to become a part of a Percona Toolkit, however since it’s pretty much a one-liner, I just keep it in my snippets pool for easy copy’n'paste.

Here’s a piece of an output from a server that was almost 27 days behind just yesterday:

Of course, it is still not perfectly accurate and it does not account for any potential changes in queries, workload, warm-up, nor the time it takes to run the mysql cli, but it does give you an idea and direction that replication slave is going. Note, negative values mean replication isn’t catching up, but values themselves are mostly meaningless.

Here’s what the weird acronyms stand for:

  • s_behind – current Seconds_Behind_Master value
  • d_behind – number of days behind based on current s_behind
  • c_sec_s – how many seconds per second were caught up during last interval
  • eta_d – this is ETA based on last interval
  • O_c_sec_s – overall catch-up speed in seconds per second
  • O_eta_d – ETA based on overall catch-up speed (in days)
  • O_eta_h – same like previous but in hours

Let me know if you ever find this useful.

About Aurimas Mikalauskas

Aurimas joined Percona in 2006, a few months after Peter and Vadim founded the company. His primary focus is on high performance, but he also specializes in full text search, high availability, content caching techniques and MySQL data recovery.

Comments

  1. Table checksums should not add delay, because pt-table-checksum won’t do anything until replication is caught up.

  2. Michael Wehrle says:

    Can you trust “Seconds_Behind_Master” to be accurate? I have been rebuilding servers lately for a client, and typically by the time we get the new slave built, its between 24-36 hours behind, and something like a dozen binlogs “behind”. What I have found is that the “Seconds_Behind_Master” doesn’t even show non-zero until the slave is on the same binlog as the master. That doesn’t even account for the fact that its often very inaccurate.

    Also… 27 days behind? Why not rebuild the slave with a hot backup from the master?

  3. Michael Wehrle says:

    Also, the code is quite nice actually. Succint and useful, given the intent.

  4. Baron,

    normally – indeed pt-table-checksum wouldn’t do that. Interestingly though, the slave in the example above was actually a [3 days] delayed slave (using pt-slave-delay) but pt-table-checksum was configured to not check its status, for a pretty obvious reason that it will always be at least 3 days behind. We still want checksums to run though, to check that our last-resort backup is consistent.

  5. Michael, -

    it’s not completely accurate, but you also have to be sure you know what the number means. It is NOT “how long until the slave will catch up”, but rather “what is the difference between NOW() and the time when master was writing the events [I'm currently executing] into the binary log”. In that sense, it is accurate to one second.

    One problem with Seconds_Behind_Master is that if will show zero if it is IO thread, not SQL thread that is behind.

  6. Bill Willits says:

    Very nice post. And, super-useful. I’ve had the same need for this tool many times. Thanks for creating.

  7. aftab says:

    Would mk-slave-prefetch help in this situation?

  8. Thanks for sharing the script.

    The line with:
    grep Seconds_Behind_Master | awk ‘{print \$2}’

    Could be rewritten as:
    awk ‘/Seconds_Behind_Master/ { print \$2 }’

    It might be possible to use getline and make it more pure awk.
    http://www.gnu.org/software/gawk/manual/html_node/Getline_002fVariable_002fPipe.html#Getline_002fVariable_002fPipe

    AWK is a great tool. The only downside is that it’s probably not available by default on most Windows and other non-UNIX (like) operating systems.

    Rewriting this script to run server side (stored procedure/view/etc) seems favorable, but is not possible:
    http://code.openark.org/blog/mysql/reading-results-of-show-statements-on-server-side

  9. aftab, it might, though I’d probably try Yoshinori’s Replication Booster or Domas’ Replication prefetcher first, if need be.

  10. Daniël, -

    thanks for your tips!

  11. Thanks for sharing this script! It is quite useful!

Speak Your Mind

*