Quite frequently I see customers looking at recovery as on ability to restore data from backup which can be far from being enough to restore the whole system to operating state, especially for complex systems.
Instead of looking just at data restore process you better look at the whole process which is required to bring system to the working state, including data consistency requirements and times. This has to be considered for different data loss scenarios which may happen.
Let us look at simple example – a master with 1TB of database size replicating to 50 servers in 5 different Data Centers via single Replication Relay server in each. Forget the single point of failure for the second and just think what problems we may have to deal with.
First lets look at the master. What may happen to it ? We can have Master having soft crash in which case it will be unavailable for some time but we can get all the data back… or sort of. In practice you have to be very careful such as using sync-binlog innodb_flush_logs_at_trx_commit=1 and only using Innodb tables to be OK in most cases. There are still some edge cases such as modifying meta data stored in MyISAM tables which can get master out of sync with slaves in case of soft crash. Unless you got into one of these rare cases slave should be able to continue after Master is back online.
Do you have to wait for master to recover ? This is where your data consistency requirements come in play. Remember replication is asynchronous so whenever you switch to the slave in case of master failure you may loose transactions. Google semi-synchronous replication patches can help with it… but they are not yet in the stock MySQL. Yet another way is using DRBD to get a standby MySQL server or at least synchronously replicated master binary logs. If you can’t loose any single transaction you’ve can’t simply switch to the slave.
What if you can ? The switch to the slave in this case is not very easy too – all slaves can be on different positions at the master and you need to pick the most up to date to promote. Plus you need to recompute positions as they should be on promoted slave and slave should have –log-slave-updates enabled so it somethat has copy of master logs. In many cases I’ve seen people do not do that and simply point slaves to the starting position of the promoted master – this is dangerous because you’re risking all slaves to be inconsistent withe each other, plus if server was seriously behind you’re risking to get major inconsistence because relay logs will be lost if you just re-point slave. So at least you should wait for slave to process all its relay logs before re-pointing it.
Interesting enough Google has solution for us again which comes as “log mirroring” patches which make sure Slaves has copy of logs as they are on the master.
Now what do you do in case of hard crash this is when the data is lost on the master ? This is when you have master data lost, such as you have RAID or disk failure. Though it also can be things like Innodb corruption or soft crash which you can’t recover promptly enough.
In this case most typically you would plan recovery by switching to the slave (as described) or standby server via DRBD or SAN.
As you can see we never mentioned recovering from backup so far. It will be needed in the worst case of data loss which is trashing the data which gets to all slaves via replication. This can be caused by user or application error or security breach.
What choices do you have in this case ? Your main options are using Backup or Slave with delayed replication (which you already could have set up with mk-slave-delay from Maatkit.
Delayed slave is especially helpful if application can operate with just master as in this case you can switch very quickly (just skipping bad statements and catching up)
The main challenge in such failure is the fact you have many trashed copies to deal with. If you have just one or several small tables corrupted you can reload them. One option is to reload them on the master (and they will be replicated down to all slaves) the faster however (especially if you have many tiers of replication) is to bring all slaves to the same point in time and load data locally with SQL_LOG_BIN=0 set for session.
If the large portion of data trashed you may need to recover full database on all slaves which is best done in binary mode for large data sets. Such global recovery can also put very high stress on your network and backup storage and take a lot of time. It also may be extremely difficult to get the large backup in timely fashion over long distance network, meaning it is best to have local backup (and delayed slave if you use one) in each data center you have.
The complexity of recovery is another “liability” of compex replication tree setup. On the contrary sharded master-master pairs (or master with few slaves) are much easier to deal with.
Recovering the data with replication you always have to keep replication positions in mind. Such if you recover master you need to recover slaves to matching snapshot – either it has to be same state (which is hard to manage) or you need to ensure you understand the position on the master to which backup corresponds to. This becomes more complex if you have complex replication hierarchy as slave only knows its position on its own master not on the “root” master.
Note there are also some solutions based on “Continuous Data Protection” class of backup which can be very helpful to go back in time with your data. One of vendors offering solution for MySQL is R1Soft. Though I have not had a chance to look at it in details.
What is about slave loss ? The slave loss is normally less of the problem. You can reclone slave from the master, another slave or restore from backup. So this is just question of having decent capacity planning (such as being able to shut off 2 slaves and still operate normally), have LVM setup if you want to avoid shutting off slave or master to clone the data and making sure the logs on the master go far back so you can restore from several of backup generations and do point in time recovery.
Timing recovery is also important. Especially for write intensive environments it may take many days to catchup from weekly backups by binary logs so make sure to time it properly.
In the real life environments can be even more complicated – one may use partial replication, replication to different storage engine, add some tables beyond tables which are being replicated which all has to be accounted for for in the process of replication.
It is also worth to note beyond these 3 main recovery scenarios there are number of other cases which you have to deal with (which often can be resolved by doing recovery be one of these 3 protocols, but you can also take as shortcut) – for example you may have master or relay binary log corruption. Master or Slave running out of space, Slave crashing (and loosing its position on master), Replication breakage (or running out of sync) due to MySQL bugs or wrong use.
Interesting enough very few people have their data recovery practices ironed out so they can answer how they would handle at least these 3 data loss cases for each of servers they have deployed. Even fewer have gone beyond theory and have tested the processes or have regular testing in place.