July 22, 2014

Rendundant Array of Inexpensive Servers

So you need to design highly available MySQL powered system… how do you approach that ?
Too often I see the question is approached by focusing on expensive hardware which in theory should be reliable. And this really can work quite well for small systems. It is my experience – with quality commodity hardware (Dell,HP,IBM etc) you would see box failing once per couple of years of uptime which is enough to maintain level of availability needed by many small systems. In fact they typically would have order of magnitude more availability issues caused by their own software bugs, DOS attacks and other issues.

However as your system growths the reliability goes down. If you have 100 servers with each failing every 2 years this is about a server a week which is bad and if you’re into thousands and tens of thousands of servers server failures are becoming common place so it is important to make sure failing server does not affect your system and also what you can recover from server failure easily

So you should assume every component in the system can fail (if it is Server,Switch,Router,Cable, SAN) etc and you’re ready to deal with this. It does not mean you always have to ensure you stay fully operational after any failure but at least you should understand the risks. For example you may want to choose to keep single Cisco router because it has its own internal high availability on the component level which makes it extremely unlikely to fail, because you have 4 hour onsite repair agreement and because it is just freaking expensive. Though may be redundant less expensive systems could be better choice.

I would highlight again every component can fail it does not matter how redundant it is inside. The SAN is very good example – I’ve seen Firmware glitches causing failure in the SAN which was fully redundant on the component level. It is not every hardware component but also any code may fail as well. This is actually what makes your own code often the weakest link in availability.

Depending on failure rate you also should be thinking about automation – for frequent failures you want to recovery (like getting spare Web server and putting it online) to be automatic or done with simple manual command. For complex and rare failures you may have less automation – if certain type of failure happens once per couple of years for many evolving systems there is very high chance the old automation tools may not work well (this is of course unless you always test all automated failure scenarios regularly).

So if we’re designing the system so it can tolerate hardware failures should we bother about hardware quality at all ? The answer is yes in particular for classic database/storage systems. Few systems are design with so much error detection and automated handling in mind as Google File System.

In particular you want to make sure Error Detection is on the good level. For example if you’re running the system without ECC memory chances are your data will melt down and you will not notice it for long time (in particular if you’re using MyISAM tables) which can cause the error to propagate further in the system and make recovery much more complicated than simply swapping the server. This is exactly one of the reasons many high scale installations prefer Innodb – it is paranoid and this is how you want your data storage to be. This is also why Sun is so proud about checksums on the file system level in ZFS.

What is about RAID when ? As strange as it may sound but you should not relay on RAID for your data safety. There are many ways to loose data on RAID system even if you’re running RAID6 with couple of hot spare. The RAID is just dramatically reduces chance of data loss in case of hard drive failure and this is good because recovering database servers is not fully automated in most cases. Plus there may be system performance impact and (in particular if you use MySQL Replication for HA) the switch to the new server may not be 100% clean with few updates lost. RAID, especially with BBU also makes a good sense to get extra performance out of the box.

Some installations are using RAID0 for slaves – in these cases there are typically many slaves and recovery of the slave is very easy and causes no business impact. This is fine assuming you do the math and the performance gains or cost savings are worth it.

Another good RAID question is if Hot Spare should be used. I normally do not use it because it a large waste, especially as most of systems have even number of drives, so if you’re looking for RAID10 setting up hot spare costs you 2 drives. Having hot spare does not add a lot to high availability – if you have proper RAID monitoring in place and keep couple of spare hard drives on the shelf in the data center we’re speaking about couple of extra hours running in degraded mode. Even if you do not have spare hard drive you can often pool the one from the spare server and have the “warranty man” to replace it instead.

It is also a good question if you need redundant power supplies. In my experience they rarely fail so having redundant power supplies does not increase availability when it comes to hardware failures that much and so if you just look from this angle it may be justified only for the most critical servers. Do not forget redundant power supplies also increase server power usage a bit. Redundant power supplies however are helpful if you have multiple power feeds, so server can stay up if one of the phases has a power loss. Another benefit is – in redundant power supply will often allow to do some power work (like moving server to different circuit) without downtime which may be or may not be something important for you.

Finally I should mention about spare component. These are paramount if you’re designing highly available system. Having spare drives on the shelf, spare switches, spare servers (which are same as better as servers which are in production) is paramount. It is important promotion happens easily and there are no performance gotchas (ie 8 core server can be slower than 4 core with MySQL). It is best if you just put couple of spare servers in each purchase batch so they are absolutely same configuration but I know it is not always possible. Dealing with spares is yet another reasons to avoid the “server zoo” and have limited set of purchased configurations which are reviewed yearly (on other regular interval) rather than finding different best configuration each week.

Having spare servers also means you often do not need most expensive support agreements and Business Hours Monday-Friday is good enough for you – you’re not waiting for support for production anyway just fall back to another server and use it. Of course you can imagine cases when problem could affect all servers of the same type but it is not that frequently seen in practice.

To avoid multiple servers failing at the same time it is of course important to QA/Stress test servers before you put the load on them. I’ve seen multiple cases when something would go wrong and all servers of same configuration will experience the same problem. Proper QA/Stress test reduces the chance of this but you better to be testing with load similar to what you expect in production.

Requirement to have Spare hardware is also the reason why commodity inexpensive hardware is often better choice. If you have couple of $1M in production you need another $1M server as a spare and this is expensive. If instead you have 10 pairs $10K boxes having couple of spares would cost you only $20K plus I found it in many cases much easier to convince “finance” people to buy something cheap which is not used most of the time when to spend a lot of money on the server which will be where sitting doing idle.

How many spare servers you need – you would see it in practice. As I mentioned at least one for any hardware class you have. If you have many failures you need more of course. You may also decide to keep more spare systems when you can use them to help capacity management, especially if you have multiple applications which do not share hardware but share the data center. You may have “spares” to provide extra on demand capacity for web servers or memcached quite easily, or say increase number of slaves if you have unexpectedly high number of reports launched by users etc.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. One of my favorite HA techniques is simply to avoid putting important data with unimportant data, so you can isolate the important components. These can be smaller then, and thus much easier to make HA. If you do not do this, when something happens everything goes down the toilet. Database down? Oh well, there goes the click tracking! And the this, and the that, and… suddenly the emergency is REALLY an emergency, and you’re recovering a massive system instead of a little component, and it is much much harder.

  2. Does MyISAM not have checksums for on-disk data as InnoDB does? Running without on-disk checksums would be very, very awful for me.

    My crash rate has gone up to about 1 crash in 1000 to 2000 days and at least half are caused by hardware. I think this is an amazing number, but I don’t have anything to compare it to. Are numbers published for any other DBMS vendors?

  3. peter says:

    MyISAM does not have checksums by default – you can enable them but very few people do it.
    I also because MyISAM does not have page structure the checksums can’t be verified during operations. Maria should have it fixed.

    1 Crash to 1000 to 2000 days is very good number indeed :)

  4. Mike says:

    Our (MySQL) failure rate is… unknown, really. The #1 cause of database downtime has been me (running destructive commands on the wrong host. Twice, a year apart…). #2 is planned outages (install bigger disk, migrate to a new host, attach to SAN, etc). #3 is datafile corruption (3 times in 1 week, 2 years ago, running MyISAM). We’ve since switched to InnoDB and haven’t seen any non-human-driven failures since then. I come from Oracle, and never had the opportunity to run instances for uptime, so I can’t compare there, but from what I’ve experienced, hardware failure is far more likely than DB failure at this point. I have to say I’m completely blown away by MySQL’s stability. Apart from the datafile corruption a couple of years ago (MyISAM under very heavy load), we haven’t seen any failures in 3+ years, but we do wind up rebooting the nodes for various reasons (company-wide power outage to upgrade municipal power feeds, for instance) about once every 12-18 months.

    That said, we currently run master->mirrors. Our primary systems are single master -> two mirrors, mostly SAN-attached. The masters are written to and read from only by the renderfarm. User connections are directed against one mirror, which can be promoted (manually) to master in a matter of a couple of minutes. The second mirror exists as a failsafe, and as the system that gets backed up a few times a day. 1400qps 24/7, 650GB, growing by about 120 GB/yr. These are commodity (Dell) servers – cheap and easy. Hell, with the starting price of EMC Clariions down in the low tens of thousands of dollars, even that becomes a no-brainer for moderate organizations.

  5. leo says:

    In my point of view hardware failures do happen much less then people think. We have now around 50 cheap Server (some of them as old as 5 years) and the rarely fail. From time to time we have disk failures, but otherwise I do no remember any hardware failure. (We manage to keep our uptime not to high with software glitches and configuration errors ;)

    I would tend to keep spare parts hot. If you have 10 webserver (or mysql-slaves) just add some more to have some free capacity if one server fails (make your loadbalancing handle the failure), it prevents you from having troubles bringing a server into production when an other fails (people tend to be lazy and not to train there emergency plans regularly). Sure this needs very careful capacity planing.

  6. Gil says:

    Peter, your point about having extra hardware available is a very good one. This is something I’ve always been very paranoid about, especially since we use a managed host. Our current strategy is to ensure that every piece of hardware is operating redundantly, so that even if one appliance fails we know that there is *at least* one more of the same appliance in the datacenter. Moving to a new, untested hardware model is the last thing we want to do in the event of a critical hardware failure.

    We use DRDB with a spare master to approach MySQL HA. Although many people feel like it is a waste of a perfectly good server, we simply can’t afford to wait an hour (or more) for a technician at the datacenter to realize our primary server is failing, find the problem, replace the part, wait for the machine to reboot, and so on. As with all scalability plans this one is probably not permanent, but for now it works well enough and fits within our budget.

    Mike, I’ll definitely lament with you that my #1 source of downtime is running a query that I shouldn’t have. With large data sets under high load, sometimes it can be difficult to understand the full consequences of queries that in other circumstances would not be an issue. My colleagues worry about downtime while I’m on vacation, but they should worry more when I’m at my desk tinkering :-)

  7. peter says:

    Mike thanks for sharing,

    I would really like more availability shared by people which run big amount of boxes. I think you’re right about human error and here a lot depends on automation and how uniform system is. If you have 100 different systems it is much easier to screwup compared to running 1000 of instances of “sharded” data when you only use some toolset to do global operations (because it typically has to be done on all of them at once and because it would be a pain to do it manually)

  8. peter says:

    Leo,

    Indeed disk is the most typical component to fail. Speaking about slaves – indeed it is easy. It is easy to load balance transparently so you do not even notice if slave is gone and it is very easy to re-clone slaves automatically from the other slave or master. What people are typically concerned is master failures :)

  9. One should also think about using semi-expensive servers.

    *inexpensive* servers implies that you’re buying the cheapest boxes imaginable.

    It’s really a “bang for your buck” computation.

    We’re using somewhat mid-level machines now which has been yielding significant costs savings both in terms of machines pricing, power, and administration costs.

    Kevin

  10. peter says:

    Kevin,

    Inexpensive does not mean cheap :) It means commodity more than cheap in this case.
    Same as with RAID – you’re not saying because it says about Inexpensive you have to have your RAID filled with cheapest SATA drives ? You quite well can do with SAS drives which are more expensive but still.

    Also number matters – it may be for your application $20K servers are best choice, while for other it is $5K Servers – as long as you have many of them and you’re not relaying on all of them staying up it is fine :)

  11. …. just to clarify … and this might be a better way of explaining things.

    If your cluster is < 50 machines you’re probably better off scaling diagonally vs scaling horizontally / vertically.

  12. peter says:

    Kevin,

    This makes sense if you happen to have “low end” boxed. There are many clusters especially for high growth applications which got to this stage quickly which have pretty top end commodity hardware. But I guess we understand each other.

    I would be focusing on Price/Performance for the choice however the price should also include man power, power, space etc. In this case cheap low end boxes will often be loss.

  13. peter says:

    Thank you for your feedback Gil,

    Indeed DRBD is one of the good approaches to MySQL HA and is especially good one if you can’t afford loosing any transactions but do not mind several minutes downtime (and some slowness while you warmup cache) as you switch. Others choose to relay on MySQL replication (sometimes with google Synchronous replication patch) which has fast recovery and less warmup problems but has a bunch of problems on its own.

  14. gentlemoose says:

    Kevin:
    “inexpensive” does not mean “cheap”. As peter said, it’s a price/performance equation. A cluster of dell 2950s vs a sun e15k, for instance. It all depends on your workload. I can buy literally hundreds of “commodity”, supported, non-crap machines for the price of a piece of big iron and not have to worry about unplanned outages because I have the resources to failover easily.

  15. Jestep says:

    As far as bang for the buck, I love the Tyan Intel 701 servers. They have been extremely reliable in my experience, and are signifigantly less expensive than Dell’s, Sun’s and pretty much anything else I’ve come across. They work great with a variety os OS’s and support high-end raid controllers. Almost entry level price, with performance to match anything.

    Something that I think a lot of people / companies overlook is power protection. I’ve seen massive clusters plugged directly into standard outlets (Gulp)… Invest in a good UPS system if for nothing else than to bring your servers down nicely, it really goes a long way to preserve hardware.

    Not sure how people keep their jobs when they lose hardware every time the lights flicker, but there’s a ton of companies doing it.

Speak Your Mind

*