As my first contribution to the MySQL Performance Blog, I joined Percona at the beginning September, I chose to cover the various high-availability (HA) options available for MySQL.  I have done dozen of MySQL HA related engagements while working for Sun/MySQL over the last couple of years using Heartbeat, DRBD and NDB cluster and I’ll probably be doing the same at Percona.  I have built my first DRBD based HA solution nearly 10 years ago.

There is quite a lot of confusion surrounding HA solutions for MySQL, I will try to present them objectively, my goal here been not to sell any specific technology but to help people choose the right one for their needs.  This post is first of a series,  I don’t yet know how many I will write in the series.

Before we start, it must be stated that high-availability is not only a matter of technical solutions, good management practices covering monitoring, alerting, security and documentation are also needed to insure a successful solution. In other words, no solution is fool proof, if a high-availability solution is running in recovery mode for months without nobody caring about it, the risk of a complete failure is much higher.

In order to all be on the same page, I will first give some definitions of the key terms.  I don’t pretend those definitions are perfect but let’s build on them.


High-Availability

Let’s first define what is meant by high-availability.  The most general definition would be that a high-availability setup is special  computer architecture designed to improve the availability of a computer service, like a MySQL database.  High-availability, HA for short, introduces a wealth of peculiar concepts, we will first review the main ones.

Uptime/Downtime

Uptime means the service is available even if degraded as long as it is above some defined performance threshold. Downtime means the opposite, either the service is completely down or unresponsive according to the defined performance threshold.  In many cases, people don’t define a performance threshold, it is basically the service monitoring frequency and timeout that fix it.

Level of Availability

The level of availability is basically the guaranteed percentage of uptime you will get over a year.  It has always been a subject of debate and it is something hard to evaluate since, most of the time, the samples are small and all the conditions of the deployments are not easily controlled. See the level of availability as the availability you, as the operator of the service, can promise in case of a worse case scenario. For example, 98% availability means a downtime of a little more than 7 days per year.  The cost is approximately an exponential function of the level of availability and has to be compared with the downtime cost. If an HA setup with a level of availability of 99% is fairly simple and affordable, moving to 99.9% and 99.99% can be much more expensive and complex. Also, you need to consider the environmental factors.  If your ISP cannot guarantee you a level of availability of more then 99.9% for the Internet access, it is useless to go beyond that no matter the importance of the application.

Single point of failure (SPOF)

Single points of failure are the things you are looking to remove when you build an HA solution.  Basically, they are the devices/things that if they are not available, the service is down.  A data center can be considered a SPOF at a high enough level of availability.  Usually the more SPOFs you consider, the higher the availability of your solution and the higher its cost.

Recovery/failover

Recovery (or failover) is the process by how a HA setup recovers from a failure. During the recovery time, the service is down.  With the most simple solutions, it can be a manual process but most of the time, it is automatic.  Also, there is a time associated with the recovery.  If a failure happened during the night and the operator is only available from 8am to 5pm then, you might have a recovery time of more than 12 hours.  The more complex solutions have automatic recovery and do not need human intervention.  Once again, although they are some exceptions, faster and automatic recovery usually means higher costs.

Cluster

A bunch of servers used for the same task.  In our case, dedicated to high availability of the MySQL database service.

With theses common definitions, we will then be able to move to the second step, the questions.

9 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Morgan Tocker

– I like the way that you pointed out how level of availability is always subject to debate, I couldn’t agree more.

I wonder if you could comment on unscheduled versus scheduled maintenance? I’ve seen some technology providers claim high “nines” support but they neglect the ability to easily do online upgrades.

Darius Jahandarie

I’ve always found the “Level of Availability” a fairly useless measurement. As more “9”s are tacked onto the end, it just makes it feel even less quantitative and more marketesque. An actual “7 days per year” sound much more analytical and believable.

Giulio

Can I ask why you leave SUN? And what do you think about the future of mysql at sun/oracle and outside?

Bill

Looking forward to more in depth articles on the subject. It will be nice to read from without the Sun/Mysql(Soon to be Oracle) marketing fluff in the way. All of their case studies and webinars, while emphasizing the benefits of common Mysql HA Setups fail to address the drawbacks.

Travis Sullivan

With a lack of snmp support in mysql, monitoring performance of clustered servers is quite a task. Even if we have several mysql replication servers, when the master db fails, I don’t understand how the slave servers continue to operate.

imran

One other term that warrants much love is that of failback – which depending on the setup can either go smoothly or be incredibly painful.

Travis


I was referring to mysql not being easy to monitor via snmp, at least not for us less skilled folks. MRTG makes nice work out of the box for cpu, memory, etc. Making changes via my.cnf is definately a trial by wait process. I modified my innodb log file size today from 25M to 128M and my disk IO decreased very nicely. I obtained that from your other blog 🙂
Mucho thanks!

Peter Zaitsev

Yves,

Thanks for great post. The downtime is indeed tricky measurement. The planned vs Unplanned downtime and online upgrades of software or schema changes is one issue. The measurement interval is another issue – there is the difference between couple of min downtime every day vs going down for a day once in a several years. Basically 99.9% uptime every day of the year is very different from 99.9% measured across the full year.

Finally I wish it always would be black and white with systems up or down – in a lot of cases we can see system being degraded, acting funny and being unable to perform some of the functions with required performance but not totally down.