As my first contribution to the MySQL Performance Blog, I joined Percona at the beginning September, I chose to cover the various high-availability (HA) options available for MySQL. I have done dozen of MySQL HA related engagements while working for Sun/MySQL over the last couple of years using Heartbeat, DRBD and NDB cluster and I’ll probably be doing the same at Percona. I have built my first DRBD based HA solution nearly 10 years ago.
There is quite a lot of confusion surrounding HA solutions for MySQL, I will try to present them objectively, my goal here been not to sell any specific technology but to help people choose the right one for their needs. This post is first of a series, I don’t yet know how many I will write in the series.
Before we start, it must be stated that high-availability is not only a matter of technical solutions, good management practices covering monitoring, alerting, security and documentation are also needed to insure a successful solution. In other words, no solution is fool proof, if a high-availability solution is running in recovery mode for months without nobody caring about it, the risk of a complete failure is much higher.
In order to all be on the same page, I will first give some definitions of the key terms. I don’t pretend those definitions are perfect but let’s build on them.
High-Availability
Let’s first define what is meant by high-availability. The most general definition would be that a high-availability setup is special computer architecture designed to improve the availability of a computer service, like a MySQL database. High-availability, HA for short, introduces a wealth of peculiar concepts, we will first review the main ones.
Uptime/Downtime
Uptime means the service is available even if degraded as long as it is above some defined performance threshold. Downtime means the opposite, either the service is completely down or unresponsive according to the defined performance threshold. In many cases, people don’t define a performance threshold, it is basically the service monitoring frequency and timeout that fix it.
Level of Availability
The level of availability is basically the guaranteed percentage of uptime you will get over a year. It has always been a subject of debate and it is something hard to evaluate since, most of the time, the samples are small and all the conditions of the deployments are not easily controlled. See the level of availability as the availability you, as the operator of the service, can promise in case of a worse case scenario. For example, 98% availability means a downtime of a little more than 7 days per year. The cost is approximately an exponential function of the level of availability and has to be compared with the downtime cost. If an HA setup with a level of availability of 99% is fairly simple and affordable, moving to 99.9% and 99.99% can be much more expensive and complex. Also, you need to consider the environmental factors. If your ISP cannot guarantee you a level of availability of more then 99.9% for the Internet access, it is useless to go beyond that no matter the importance of the application.
Single point of failure (SPOF)
Single points of failure are the things you are looking to remove when you build an HA solution. Basically, they are the devices/things that if they are not available, the service is down. A data center can be considered a SPOF at a high enough level of availability. Usually the more SPOFs you consider, the higher the availability of your solution and the higher its cost.
Recovery/failover
Recovery (or failover) is the process by how a HA setup recovers from a failure. During the recovery time, the service is down. With the most simple solutions, it can be a manual process but most of the time, it is automatic. Also, there is a time associated with the recovery. If a failure happened during the night and the operator is only available from 8am to 5pm then, you might have a recovery time of more than 12 hours. The more complex solutions have automatic recovery and do not need human intervention. Once again, although they are some exceptions, faster and automatic recovery usually means higher costs.
Cluster
A bunch of servers used for the same task. In our case, dedicated to high availability of the MySQL database service.
With theses common definitions, we will then be able to move to the second step, the questions.
@Yves Trudeau – I like the way that you pointed out how level of availability is always subject to debate, I couldn’t agree more.
I wonder if you could comment on unscheduled versus scheduled maintenance? I’ve seen some technology providers claim high “nines” support but they neglect the ability to easily do online upgrades.
I’ve always found the “Level of Availability” a fairly useless measurement. As more “9”s are tacked onto the end, it just makes it feel even less quantitative and more marketesque. An actual “7 days per year” sound much more analytical and believable.
Can I ask why you leave SUN? And what do you think about the future of mysql at sun/oracle and outside?
Looking forward to more in depth articles on the subject. It will be nice to read from without the Sun/Mysql(Soon to be Oracle) marketing fluff in the way. All of their case studies and webinars, while emphasizing the benefits of common Mysql HA Setups fail to address the drawbacks.
With a lack of snmp support in mysql, monitoring performance of clustered servers is quite a task. Even if we have several mysql replication servers, when the master db fails, I don’t understand how the slave servers continue to operate.
One other term that warrants much love is that of failback – which depending on the setup can either go smoothly or be incredibly painful.
@Morgan Tocker, the availability should count the online upgrade operations. That’s one of the great thing about NDB Cluster, all the system can be upgraded without down time.
@Giulio, it is mainly a question of being comfortable to work a given company. Oracle is a very successful company but it also has values that I am not ready to be associated with.
@Bill, They are coming… I have a few days off after being on site for three weeks, I should publish the next one this week.
@Travis, I am not sure I am following you. Monitoring servers is fairly easy with Cacti or even through SNMP and custom scripts (Maybe I should blog about this…) If the slave has a consistent database image why could it not be able to continue to operate? I mean, the write traffic will of course needed to be redirected to it.
@Imran, You are completely right, failback can be a real pain and I should highlight that point more.
@Yves Trudeau
I was referring to mysql not being easy to monitor via snmp, at least not for us less skilled folks. MRTG makes nice work out of the box for cpu, memory, etc. Making changes via my.cnf is definately a trial by wait process. I modified my innodb log file size today from 25M to 128M and my disk IO decreased very nicely. I obtained that from your other blog 🙂
Mucho thanks!
Yves,
Thanks for great post. The downtime is indeed tricky measurement. The planned vs Unplanned downtime and online upgrades of software or schema changes is one issue. The measurement interval is another issue – there is the difference between couple of min downtime every day vs going down for a day once in a several years. Basically 99.9% uptime every day of the year is very different from 99.9% measured across the full year.
Finally I wish it always would be black and white with systems up or down – in a lot of cases we can see system being degraded, acting funny and being unable to perform some of the functions with required performance but not totally down.