April 16, 2014

Lost innodb tables, xfs and binary grep

Before I start a story about the data recovery case I worked on yesterday, here’s a quick tip – having a database backup does not mean you can restore from it. Always verify your backup can be used to restore the database! If not automatically, do this manually, at least once a month. No, seriously – in most of the recovery cases I worked on, customers did have some sort of backup, but it just wasn’t working, complete and what not. Someone set it up and never bothered to check if it still works after a while.

Anyway, this post is not really about the backups but rather about few interesting things I learned during last recovery case.

First, some facts about the system and how data was lost:

  • MySQL had a dedicated partition on XFS file system
  • Server was running innodb_file_per_table
  • There was a production master and two slaves, all had same setting
  • Developer accidentally ran DROP DATABASE X on the wrong machine (production master)
  • All slaves followed and dropped their copy of the data
  • The important tables were all InnoDB
  • Having a backup, customer has first attempted to restore from backup on the production master

Luckily (or rather, unfortunately) backup only had table definitions but not the data so no data was written to file system. Mind however that restoring a backup could have been fatal if it would have written some junk data as that would have overwritten the deleted files. Now, here’s what I learned while working on this case:

Recovering from XFS is possible. Just a month ago we had a team meeting in Mallorca where we went through various data loss scenarios. One of them was deleted files on xfs – we all agreed on few things:

  • recovering files from xfs is hard, if at all possible
  • we had no recovery cases on xfs, most likely because:
  • whoever is using xfs, is smart enough to have backups set up properly

Now I’m not picking on the customer or anything – indeed they did have a backup set up, it’s just that some (most important) tables weren’t backed up. We did not try any of the file recovery tools for xfs – apparently they are all targeting specific file types and sure enough InnoDB is not one of the supported files. What we did is we simply ran page_parser on the (already) unmounted file system treating it as a raw device. I was surprised how amazingly simple and fast it was (did you know that latest version of page_parser identifies pages by infimum and supremum records?) – 10G partition was scanned in like 5 minutes and all 4G of innodb pages were successfully written to a separate partition. That’s the easy part though – you run page parser, wait and see what you get.

If InnoDB Data Dictionary was not overwritten by an attempt to restore from the backup, actually second part would’ve been quite easy too, but it was so I could no longer identify correct PK id for specific tables by just mapping data dictionary table records to index records. Instead I had to grep for specific character sequences against all pages. Note however that only works for text in uncompressed text columns (varchar, char, text) but what if tables don’t have any text columns at all? Then, you read further.

GNU grep won’t match binary strings. This isn’t new, I kind of knew grep couldn’t look for binary “junk”, but I really needed it to. Why? Well, here’s few of the scenarios we’ve gone through yesterday:

1. There was this rather big table with integer and enum columns only, where we knew a rather unique PK, well something like 837492636 so we needed a way to find pages that match it. InnoDB would internally store integers in 4-bytes rather than 10 bytes if it were stored as a sequence of characters, so “grep -r 837492636 /dir” would not have worked.

2. There was another table, a small one with 4 smallint columns where all we could match on was a sequence of numbers from a single record – customer knew that there was at least one row with the following sequence: 7, 3, 7, 8. Matching by any of the numbers would be insane as it would match all of the pages while matching on numbers as a sequence of characters would not work for many reasons.

This is where I found bgrep which was exactly the tool for the task. In the case number one, I have just converted number 837492636 to it’s binary representation 0x31EB1F9C and ran “bgrep 31EB1F9C /dir” – there were only like 10 other matches across the 4 gigabytes of pages, some of them probably from the secondary pages, but when you only have that many pages to check it’s really simple.

Second case seemed somewhat complicated, but it really wasn’t – all of the columns were fixed size – 2bytes each, so the thing we had to look for was this sequence: 0007000300070008. I was expecting a lot of mismatches but in fact I ended up with only one match pointing exactly to the right page and so the right index id.

The other thing I would note about bgrep – it was so much faster than matching text using grep, so if you happen to have a lot of data to scan and you have to choose between matching text and number, matching a number using bgrep may work much better.

We are considering shipping bgrep as part of percona recovery toolset, with some additional converters so we can match against various date/time columns as well.

About Aurimas Mikalauskas

Aurimas joined Percona in 2006, a few months after Peter and Vadim founded the company. His primary focus is on high performance, but he also specializes in full text search, high availability, content caching techniques and MySQL data recovery.

Comments

  1. Using XtraBackup on NFS for MySQL backups | MySQL Performance Blog says:

    [...] are completely ready to use. (You should test your backups even more than that — we’ve blogged before about that. Let us know if you want help with [...]

  2. bob says:

    “cat -v” can sometimes help when trying to grep binary files

  3. Bob, I’m not sure you could have used cat in this case – I don’t think it would represent numbers in a way you can grep then. You could use “od” or “hexdump” however, though it’s still not as convenient and fast.

  4. LenZ says:

    Awesome work, thanks for sharing this! How much of the customer’s data were you able to recover this way?

  5. Thanks, Lenz. It is aways a tricky question, when it comes to recovery. Customers usually don’t ask us to assess that and it’s not an easy question to answer anyway, unless you know exactly how much data you should have had. Either way, I should have mentioned in the post that all of the tables customer asked for were recovered (recovery is done manually table by table) and the amount of data recovered was, at least how it seemed to the customer, correct. There is a chance some of data was overwritten but it was negligible (if anything at all) compared to what was restored.

  6. John Laur says:

    The method I use to verify that my production backups are working is to restore them to our staging and development database servers as part of the normal course of test and dev. We use a complex environment with NDB, mixed storage engines, and replication. Making the restore process integral to some part of day-to-day work ensures that if it stops working it gets noticed very quickly.

  7. John Laur says:

    Oh yeah we store all of it on xfs too so there you have some more corroboration for your point “whoever is using xfs, is smart enough to have backups set up properly”

    Personally I wouldn’t even bother attempting to recover from xfs; I did a recovery on reiserfs with tailing once and that was enough for me! Kudos to you guys.

  8. John, congratulations — your recovery testing is not typical, sadly. I think testing once a month is way too seldom. I would not go longer than a week. Especially for a fast-growing app, when data size might grow a lot from day to day. See comments on Morgan’s previous post: http://www.mysqlperformanceblog.com/2010/11/08/an-argument-for-not-using-mysqldump-in-production/

    My least favorite experience with reiserfs was after a power outage. Afterwards, the server ran fsck all day and all night for a fairly small volume — probably 100GB or less, as I recall. In the morning, the filesystem had nothing on it; fsck had moved everything to the “I don’t know what this data is, here’s a snippet” directory. That soured me on reiserfs. A good filesystem should be able to recover from something like that, and reasonably fast.

  9. Baron, sure – by “at least once a month” I only meant that once a month is much better than never and if you (seem to) not have time to implement something smart around that like John has, having that done manually every now and then still leaves a great chance of catching a problem early on. YMMV though :)

  10. John, that’s indeed a very nice approach to it. I’ve only seen very few customers do it that way though..

  11. Aurimas, yeah I didn’t mean to disagree with you — it’s hard enough to convince people to test backups at all :-)

Speak Your Mind

*