April 19, 2014

How reliable RAID really is

This post is not exactly about MySQL Performance or about Performance at all, but I guess it should be interested to many MySQL DBAs and other people involved running MySQL In production.

Recently I’ve been involved in troubleshooting Dell Poweredge 2850 system running RAID5 using 6GB internal hard drives, which give about 1.4TB of usable space.

The problem started than one of hard drive was set to “Predicted Failure” state by “Patrol Read” which is automatically done by PERC4 (LSI Logic megaraid) controller. Dell was prompt to ship replacement hard drive and drive was replaced. This should be happy end of the story but in reality troubles only began.

After hard drive is replaced RAID has to be rebuilt but the problem in this case was…. rebuild failed bringing all logical drive down because yet another hard drive got bad block. Replaced hard drive was “failed” because it could not be rebuilt and other one because of read failure. So my first advice would be – Run consistency check before replacing hard drive with predicted failure to minimize chance of double drive failure in RAID. It is good to run consistency checks on regular basics anyway but this final run would not hurt.

The next interesting thing is – there is not too much advice which could be found in Dell documentation about handling RAID with two failed hard drives. The impression is it should never happen (while it does, and not as rarely as one would hope) and if it happened you should just go and get your backup. Restoring over 1TB of data is never fun but in this case there was no backup which made recovery more important.

Interesting enough Logical drive could be brought online and used by forcing newly failed drive online. It probably just had couple of bad blocks – but there were no way to resync logical hard drive in this situation.

What one would like to do in such case is to fore SCSI drive to remap those bad blocks. Couple of files could be corrupted but it is much better than loosing everything. Unfortunately neither RAID BIOS nor RAID tools do not provide you with such feature.

Happily Dell Bios has little option which allows you to disable RAID controller and access your disks as simple SCSI. Changing this option will result in various scarry messages such as “Data loss will occure” but in reality you could change it back and forth, you just should be careful and know hat you’re doing.

In SCSI BIOS there is an option to perform “Verify Media” which can be used to scan hard drive and remap bad blocks. After remapping is done RAID mode can be enabled back again and array could be rebuld just fine. There is of course chance some data is corrupted so checking file system and MySQL database is good idea.

So my story had a happy ending with only minimal (yet to be discovered) data loss but it coul be worse.

There are few things this case reminds about:

Do not assume RAID is Reliable. RAID is more reliable than plain disk, RAID6 is more reliable than RAID5 but all they can fail, even expensive SAN systems. So make sure you have backup plan if it happens if you care about your data. This is of course not to mention software bugs and user errors which are other reasons why you want backups. Do not trust to any single piece of hardware in HA scenarios.

Have backups ready. If you care about your data backups are must whatever other HA methods you use.

Large data sets take time Restoring 1.5TB volume is likely to take hours can you afford it ? Even verifying media on 300GB hard drive took several hours. This could be one more reason to scale out and keep managable size storage on each node. At least multiple smaller RAID volumes could be used so rebuilding any of them takes less time.

There are also couple of ideas:

Dell – why do not they have Verify Media with ability to remap bad blocks in the RAID BIOS itself or RAID tools ? Should not be big deal especially for offline drive.

Backups with Instant recovery – could be interesting to try to integrate DRBD with LVM so snapshot could be taken and synchronized to network as backup. If quick recovery is needed snapshot could be connected via network and operations started, while it is gradually restored in the background back to the local volume. Local Networks are fast these days so it could perform very well.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Matthew says:

    I know your pain. We run twice monthly verifies on our arrays. Of course this slows the array down so we do it on the inactive side of our redundant pairs of machines. And of course in addition to the redundant machines we have backups. We’ve had to use them on occasion too.

    If you don’t already know about the megarc tool then you should check it out. It’ll let you run a verify from the command line instead of the dellmgr/megamgr gui.

  2. IF you can get away with it you can just have a redundant array of inexpensive database servers.

    For the price of a RAID card you can buy another cheap server. If you can load balance SELECTs across the boxes and have few writes you can get RAID performance and reliability with numerous cheap MySQL boxes.

    Commodity hardware is cheap and modern disks are pretty damn fast if you don’t need ONE box to exec all your queries.

  3. With RAID10, you lose half the physical disk space to mirroring, but it would have survived up to three disk failures, provided none of those failures were on the same submirror.

    At the very least, you should consider setting aside one disk as a hotspare to reduce the time window of having a second disk fail while the RAID5 array is degraded.

  4. Apachez says:

    2. Kevin: Thats basically what google does :P Having the raid on machinelevel instead of harddrivelevel.

  5. Brice says:

    Join BAARF (http://www.miracleas.com/BAARF/BAARF2.html) :-)

    It’s an association of knowledgeable sysadmin who won’t ever use RAID5 (or 3 or 4) anymore on production system…

    RAID5 alone is dangerous, you can mitigate the risk with a hotspare drive, but frankly, RAID10 is really better (for a lots of good reasons: http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt)
    Disks are cheaps nowadays…

  6. peter says:

    Matthew,

    Thanks for you hints. Yes running consistency check on regular basics is surely good idea. What is interesting in Dell/LSI docs “Patrol Read” is positioned as lower overhead alternative to consistency check. What I’ve found out however is – it does not really catches errors well enough (as in this case) plus it has some strange performance problems – in certain cases I’ve seen it slowing down array to probably 20% of its capacity for 20-30min. Could be bug but Dell just told to disable Patrol read.

  7. peter says:

    Kevin,

    Yes of course you should have multiple servers, that is much better for high availability cases especially as RAID is one of the components which fails. However building “Google Style” system build from a lot of crapy hardware might not always be best solution. There are a lot of things to consider – for example Power requirements which is often main cost factor in Colocation envinronments, maintainance – recovering “inexpensive” box might be pretty expensive, especially if it is installed in remote location, not to mention various wierd problems you might need to be ready for – database corruption due to bad memory/cooling which sometimes could be replicated, so even replication might not save you.

    So my choice is normally to have decent boxes for MySQL servers. Pretty commodify ones, no high end but reliable. Especially for small/medium company sizes when there is no time to implement management infrastructure which would allow to replace broken servers cheap and transparently.

  8. peter says:

    Vince,

    You’re right with RAID10 the probability of loosing data is lower, even though it also can fail loosing second hard drive, if they come from the same stripe. Meaning you should still want to use hot spare to minimize such window.

    Hot spare would not have helped in this case however – first drive did not fail but was manually replaced so it just started to resync as hot spare would. It however failed due to bad block on the second drive.

    In general the point is there is data of different level of importance and there are different decicions comming from it. Of course it is good to have data on RAID10 with spare disk, have reduntant servers and of course backup which you can do point in time recovery from. It is however not always the case in real systems.

  9. peter says:

    Brice,

    Nice to find you linking to this document. I was sending many customers to read it.

    I also generally recommend using RAID10 but I avoid saying it is always must. Each case is unique in practice.

    Cost factor would often be the reason but not only that. In some tests I’ve done I found RAID5 to perform _faster_ or close to RAID10 on PowerEdge 2850 for example. You can see benchmark data in “Performance Landscape” presentation from this site. Yes this does not make much sense but I’m not only the person to observe this behavior. Here are some benchmarks for SQL Server
    http://www.developersdex.com/sql/message.asp?p=580&r=4986921

    My felling is Dell or LSI just spend more time optimizing RAID5 or there are some serve performance bugs in RAID10 implementation as I do not see any physical reasons why this would be happening.

    Other reason could be – certain hardware might not have proper RAID10, sometime what is named RAID10 might be implemented as contatenated RAID1 (especially in some older models) – this would have pretty bad performance in many cases.

    Besides hardware limitations/bugs I can see RAID5 used in replicated envinronment with low volume of writes. It gives certain sequrity for the slaves so you do not have to reclone them whenever any drive fails also this means you can promote slave to the
    master without running master on insecure storage.

  10. Chris says:

    Too bad that it isn’t possible to add a drive to a raid 5 array, and then say that the new drive will be replacing another drive, after which it should sync to the new drive while keeping redundancy during rebuild.

    That way if a bad block is encountered, the data could be resolved from the redundancy.
    Also, before marking a drive with a bad block as failed, it should try to write the data back to the disk with the bad block.
    This should cause the disk to try and remap the bad block.

    RAID 6 should be capable of this already, but i’m very doubtful that many controllers handle bad blocks in this way.

  11. Ryan says:

    I just wanted to comment on a well written article.
    I’ve seen people go on and rant and rave about a “double fault” or “double drive failure” in the past but in all honestly, most of them don’t understand exactly what is going on exactly and why the raid array goes to a failed state once the replacement drive starts rebuilding.

    I recommend a consistency check every month.
    RAID is absolutely NO substitute for a backup. I think of it as a convenience; that is all.
    Not that it matters, but I do work for dell as L2 support.
    Backups people! :D

  12. Paul Meiners says:

    From my understanding a Consistency check only checks the area of raid drives which contain data, omitting the free space. The issue is multiple physical media errors can build up in the unused areas of the raid disks, only found during a rebuild, at which point the raid controller can not handle the number of errors.
    This is where patrol reads comes in. Patrol Reads checks the entire drive for media errors, so disabling it is not a good idea. What I do is disable the automatic running, and run a bat file from task scheduler to manual run it off hours, weekly or bi-weekly.

  13. Paul,

    consistency check is checking the full media. In many cases “checking and rebuilding” is essentially same process – all reads are performed and data is updated if there is any difference spotted.

Speak Your Mind

*