MySQL Replication is useful and easy to setup. It is used for very different purposes. For example:

  • split read and writes
  • run data mining or reporting processes on them
  • disaster recovery

Is important to mention that a replication server is not a backup by itself. A mistake on the master, for example a DROP DATABASE or an UPDATE without a WHERE clause will be replicated nearly instantly to all slave servers and just having a slave is not going to be helpful here. How can we avoid that kind of mistakes? Having a slave server lagging behind.

pt-slave-delay

pt-slave-delay is another tool that comes with Percona Toolkit and it is very easy to use. It works stopping and starting the SQL Thread in order to archive the lag that you have specified in command line. Lets see an example:

With “–delay=10m” I’m asking the tool to lag the slave 10 minutes behind. The rest of options are just to specify how to connect to the slave. Piece of cake 🙂 If you run the tool in order to lag a slave, for example 2 hours behind, you could have enough time to stop the replication and skip an offending query executed by mistake on the master.

Conclusion

We can improve the security and availability of our infrastructure using delayed replication slave. Is useful not only for disaster recovery but also to test how our application deal with lagged replication.

With MySQL 5.6 this technique will be included natively:

http://dev.mysql.com/doc/refman/5.6/en/replication-delayed.html


That command will do the magic, until we have 5.6 as GA pt-slave-delay can help us to get the same functionality.

5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Sheeri Cabral

Sadly, you have to know how long you want a delay for. An hour? 4 days? I’ve had developers who dropped tables and databases who came to me a week later (and in one case a month later!) and asked if there was any way to retrieve the missing data.

And there was, due to backups. I always make logical (export with mysqldump) backups and physical backups, so that if I just need to restore one database/table, I can. I tried using pt-slave-delay and did all the work to set it up, and all the work to ensure that monitoring checked to make sure replication wasn’t completely broken, just on a 4-hour delay…..it’s a difficult problem to solve. What if replication breaks entirely? how are you monitoring that?

Twirrim

In a previous job this tool was absolutely essential (under its previous mk- guise). We had a developer who was fond of dropping databases and tables in production without checking with sysadmin first, and unfortunately someone we couldn’t block access to for various reasons. We ran a 24 hour lagged slave as an essential way of staying sane.

mike morse

This is one of my favorite pt/mk tools, probably not unreasonable to say data poisoning, (particularly admin error) is one of the more frequent causes of bad/lost data. I’ve seen it first hand in the form of a senior linux admin mistakenly working in the wrong environment, ‘poof’ half a table disappeared from the master and 3 real-time slaves in a production environment. Quite a headache to recover from backups and the binlog, a delayed slave makes life a lot easier. I’ve found a delay of 4 hours usually covers the most serious cases where either the admin realizes the mistake immediately, and/or it’s readily apparent from the behavior of the environment.

However, as noted above, a slave (even delayed) is not a silver bullet, if it’s past the delay time, you’re out of luck, conversely the longer you set the delay, the more you’ll need to catch up, along with the fact that the delay can be unpredictable (single-threaded execution and I/O lag). Goes without saying multiple logical/physical backups are a must.

Sheeri, are there any issues pt-heartbeat doesn’t solve regarding the monitoring issue?

Sheeri Cabral

Mike – in fact, that’s how we dealt with the monitoring issue. But pt-heartbeat doesn’t stay up, iirc whenever we restarted mysql we had to make sure pt-heartbeat was started. It’s another thing to worry about. But yeah, it worked, and when pt-heartbeat died, we got paged, so at least there are no SILENT failures.

Frankly, I’ve found that one time having a painful recovery leads to it happening a LOT less frequently. Making it easy to recover doesn’t train people nearly as effectively.

Sheyda

Is mysql ‘start slave until master-log-file master-log-pos’ reliable enough to synch up two slaves with their master?

I have a set up now with master and 2 slaves off of it, let’s call it slave1 and slave2. I want to swing slave 2 behind slave1 to make it a chain replication , master => slave 1 = > slave 2. The only option I see to stop slave on slave2 and wait a few seconds to make sure master_log_position on slave 1 is ahead of slave 2 and stop slave 2, take the master positions off of it and start slave until these positions on slave 2 to make them in sync and to further be able to make the chain replication work, without breaking replication.

Please advise.