August 1, 2014

Data Corruption, DRBD and story of bug

Working with customer, I faced pretty nasty bug, which is actually not rare situation , but in this particular there are some lessons I would like to share.

The case is pretty much described in bug 55981, or
in pastebin.

Everything below is related to InnoDB-plugin/XtraDB, but not to regular InnoDB ( i.e in MySQL 5.0)

In short, if you use big BLOBS ( TEXT, MEDIUMBLOB, etc) (that allocated in external segment in InnoDB), you can get your database in trash state just executing update on row with blob and rolling back transaction twice ( on the same row)

The keywords that diagnose you hit this bug is

Trash state means that InnoDB won’t start, and you need to use innodb_force_recovery=3 and mysqldump your data. What makes problem even worse is that InnoDB does not report tablename, so you are pretty much blind and need to dump whole dataset, which can be long process.

The moment where DRBD come in play, is if you use DRBD for HA purposes ( as is in the case I worked with), you screwed,
as DRBD mirroring physical data, and MySQL keeping crashing on both instances – active and passive.

So DRBD can’t be considered fully reliable HA solution if there is risk that application corrupts data by itself

Now to bug 55981. It has MySQL version 5.6, but the problem exists in MySQL 5.1.50 or below and in MySQL 5.5, and
corresponding bug is 55543, which you actually can’t see, as
“You do not have access to bug #55543.”, because it is marked as “Security problem”.

And I actually tend to agree that bug can be considered as “security”.
If you running public hosting or your public users can execute direct SQL statements, I strongly recommend to upgrade to
MySQL version 5.1.51+ .

Now another interesting point – how can you be sure that 5.1.51 works.

The bug 55543 is not mentioned in ChangeLog for 5.1.51 nor 5.1.52. However if you look into source code and revision history, you can see that bug 55543 is fixed in MySQL 5.1.51. I assume it is technical problem with ChangeLog process and it will be fixed soon, but I reported it so it is not lost

At the end let me reiterate my points:
- if you have BLOB/TEXT fields in your InnoDB-plugin schema, it is recommended to upgrade to 5.1.51+
- if you provide public access to MySQL instance ( hosting provider, etc) with InnoDB-plugin installed – it is STRONGLY recommended to upgrade to 5.1.51+
- Review your HA schema. DRBD by itself (without additional solutions) can’t guaranty decent level of High Availability. And just to be clear, it is not DRBD problem, DRBD basically can’t help if there is possibility that application corrupts data by itself. For this case regular Master-Slave setup (in addition to DRBD) would protect from such problem.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. “DRBD can’t be considered fully reliable HA solution if there is risk that application corrupts data by itself.”

    In other news, DRBD does not replace backup (RAID doesn’t either). And, when is there ever no (I mean zero) risk that an application corrupts data?

  2. Vadim says:

    Florian,

    It is not about backup. In this particular case the system had downtime for couple hours, and DRBD had role as primary HA solution.

  3. peter says:

    Florian, Vadim,

    I think this is an interesting point. When designing HA system one has to choose when system can survive failure with short downtime (like switching to standby) and when the longer downtime such as recovery from backup is needed.

    Some people need to ensure even database/OS/FS bugs causing corruption and replicated by DRBD are not causing downtime – for these guys DRBD alone is not solution, so they may use DRBD+Replication as example.

    On the other hand there are some people who only want to protect from disk failures… RAID is good enough for them, probably with streaming binary logs. And if whole box melts down… one can resort to long recovery by use of backups.

  4. Patrick Casey says:

    Most of our production servers have both HA via a replication slave and more traditional backups.

    I can count on one hand the number of times I’ve needed to go to the backups, but they’ve saved my bacon each time:

    App bug issues a drop database on master. Slave happily replicates. Backup time
    Customer deletes a bunch of data they actually need. Slave happily replicates. Backup time.
    variants on the same

    An HA setup protects you from hardware failures on the primary.
    Backups protect you when something bad happens to your data

    Ironically, the better something is as an HA solution (meaning its realtime or near realtime), the less value it has as a backup solution since any kind of logical error instantly shows up there too.

  5. I don’t consider this a failing of DRBD in the slightest. DRBD is a very valuable *component* of an HA system. It’s for this reason that we always deploy a replicant node as part of an HA cluster. DRBD protects at the block device level. Replication protects at the application level. Both are necessary but not sufficient.

  6. Vadim says:

    Jason,

    That is my point. If you are looking for better HA solution you may consider both DRBD and replication ( however it is additional cost).

Speak Your Mind

*