April 21, 2014

Another reason why SQL_SLAVE_SKIP_COUNTER is bad in MySQL

It is everywhere in the world of MySQL that if your replication is broken because an event caused a duplicate key or a row was not found and it cannot be updated or deleted, then you can use ‘ STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; ’ and be done with it. In some cases this is fine and you can repair the offending row or statements later on. But what if the statement is part of a multi-statement transaction? Well, then it becomes more interesting, because skipping the offending statement will cause the whole transaction to be skipped. This is well documented in the manual by the way. So here’s a quick example.

3 rows on the master:

2 on the slave:

Execute a transaction on the master to break replication:

Broken slave:

An attempt to fix replication only caused bigger inconsistencies on slave:

This happens because the replication honors transaction boundaries, and is definitely something you should consider the next time you try to use this workaround on a broken slave. Of course, there is pt-table-checksum and pt-table-sync to rescue you when inconsistencies occur, however, prevention is always better than cure. Make sure to put safeguards in place to prevent your slaves from drifting.

Lastly, the example above is for ROW-based replication as my colleague pointed out, but can similarly happen with STATEMENT for example with a duplicate key error.  You can optionally fix the error above by temporarily setting slave_exec_mode to IDEMPOTENT so errors because of missing rows are skipped, but then again, it does not apply in all cases like an UPDATE statement that cannot be applied because the row on the slave is missing.

Here is a demonstration of the problem with STATEMENT-based replication:

 

About Jervin Real

Jervin is a member of Percona's Rapid Response Consulting team. When you come to Percona for consulting, chances are he'll be greeting you first. His primary role is to make sure customer issues are handled efficiently and professionally. Jervin joined Percona in May 2010.

Comments

  1. vishnu rao says:

    An eye opener. Cool post.

  2. quaker says:

    I wonder, how anyone can use in production MySQL database knowing, that every thing is made “partial”. We give you row based replication which is, quoting from manual: “(..) is the same as in most other database management systems; knowledge about other systems transfers to MySQL.”, but hey – it can make your slave drift, because we broken something. Of course you can report this to our bug system, and maybe in this decade we will fix it.

    Yes, this is my point of view and maybe I dramatize to much. But for example – PostgreSQL had make it replication “slave drift safe”. I wonder how they could done that and MySQL is unable to do it for a long time.

    And now technical question. What is the reason that such errors are likely to happen? I mean – database is running and some day you noticed that your slave is out if sync? Mysql is missing something like CRCs for records? I assume that queries sent to master are written to be “replication safe”.

  3. Jervin Real says:

    @quaker

    There is a number of reasons your slave can get out of sync i.e. a crash on the master where sync_binlog = 0, writing directly on the slave, abuse of sql_log_bin, non-deterministic queries, etc. Many of these are likely to be user triggered choices or mistake if not bugs. On Percona Server 5.5 and MySQL 5.6 there are additional measures to prevent some of these like crash-safe-replication http://www.mysqlperformanceblog.com/2013/07/15/crash-resistant-replication-how-to-avoid-mysql-replication-errors/ binary log checksums http://www.mysqlperformanceblog.com/2013/02/15/replication-checksums-in-mysql-5-6/ while some measures are left for the users to implement i.e. setting your slaves read_only and limiting users with SUPER privileges.

  4. Eye opener indeed! I didn’t know it killed the entire transaction. Good stuff.

Speak Your Mind

*