August 23, 2014

When would you use SAN with MySQL ?

One question which comes up very often is when one should use SAN with MySQL, which is especially popular among people got used to Oracle or other Enterprise database systems which are quite commonly deployed on SAN.

My question in such case is always what exactly are you trying to get by using SAN ?

Depending on the goals the advice could be a lot different.

Manageability SAN allows you to manage storage easily, compared to directly attach storage. If you have one box which starts to require more IO you can have more resources allocated to it. This is pattern which is somewhat typical for Enterprise managing a lot of relatively small MySQL installations or managed hosting provider. For large MySQL web MySQL installations using Sharding or replication IO needs are typically rather predictable. There are also often nice monitoring tools available to see disk utilization latencies queues etc. There is however also a downside compared to directly attached storage in terms of performance management – the SAN is shared resource (has more or less shared parts) so it is possible one application to impact other which means you can’t always analyze the local system performance independently from anything else.

Performance This is perhaps the most interesting one. I constantly see SANs sold as magic equipment which magically going to solve all performance problems, with magically means nobody knows how exactly. I would suggest always to question yourself where do you expect these performance gains to come from ? If we’re speaking about purely disk based SAN (no Flash) the drives are same drives you will see in directly attached storage and each can do only so many IO requests per second. True the SAN can have much more hard drives than directly attached storage but usually also at much higher prices per drive. The next possible advantage is software – can SAN have some very smart software instead of simple RAID you use with directly attach storage which can magically improve performance ? There are cases when it can but you surely can ask this question. For example if you share same physical drives among applications which have different peak usage patterns or something similar there can be an advantage. The question is again if it is high enough to pay for the price premium ? The third one is caching. SAN can have a lot of cache though servers can typically have more. If you can afford SAN you should afford 128GB of memory or so on the server too which will make read caching more efficient in its memory while write buffering can too be done by MySQL and local RAID (with BBU) quite efficiently. It is worth to say there is a benefit of read caching on SAN – if you have MySQL/server restart you may have warmup being quite shorter than in case of local storage.

It is also worth to note SAN does not only have advantages compared to directly attached storage but also downsides – SAN typically has better throughput (because of larger amount of drives) but longer latency because extra processing (and extra trip) involved. This in particular hurts log writes which are very latency critical.

Now what is about MySQL/Innodb specifics ? First, in MySQL if you’re looking for durable transactions the log write latency is triple important compared to other database systems. MySQL has to do more than one fsync() for the log because it internally implements XA to synchronize with binary log and also if you’re looking for maximal data consistency (as SAN users often are) you better have binary log flushed on commit too. MySQL also has broken group commit (which we have partial fix for) meaning concurrent transaction commits will need to be serialized.

Second, MySQL is often not able to submit a lot of outstanding IO requests which is needed to utilize SAN efficiently. It is especially problem if you’re running replication Slave as slave thread will issue single IO request most of the time making it sensitive to latency rather than throughput. Also you should plan on any queries executed having only one IO at the time if Innodb or MyISAM tables are used. There is read-ahead functionality but usually it is not able to drive number of outstanding request significantly. For write intensive workload you will have problem with number of outstanding writes too. Though for that we have a fix at least.

Scalability We need more IO or space than we can get from 6-8 build in drives… so lets do SAN is the story here. As I mentioned in Performance section you may be surprised performance is not “scaling” as much as you expected. You should also consider external directly attached storage which is cheaper alternative in most cases and is especially good if your IO needs are predictable, like you need 3TB of space per box – just external disk enclosure does it great. My Advice for Scaling IO capacity with SAN (this applies to directly attached storage too) is to really understand what you’re trying to scale and analyze things appropriately. I’ve seen in so many times this was a poor choice as it did not allow to scale (ie attempt to scale replication) or was a poor choice, with adding memory or SSD being a better choice.

High Availability Some people are got used to using SAN based active-passive clusters for availability purposes and they look to do the same with MySQL. This works though I do not think it is the best choice. SAN is just another object which can fail completely. Sure it is more reliable, same as server with multiple power supplies ECC memory kill-chip and RAID is more reliable than without these technologies but I always in my HA architectures it is a single “point” which can experience “failure”. Note even if hardware fully redundant the Firmware may have a glitch causing failure and data corruption (and this is not just a speculation, some clients really had it)

Even if you do not run any “Cluster” having SAN you can always “connect” the storage to another node – if server fails or if you just want to upgrade to bigger box, which can be more complicated with directly attached storage.

If you want SAN like high availability I would highly consider DRBD which can do storage level replication between directly attached volumes. Now with support for Dolphin Interconnect it can also offer very decent latency. It also have a benefit of having the system which you can actually split in two (ie for recovery purposes). Sure it needs double the storage but you get 2 copies of data too and using SAN you probably pay more than 2x premium anyway.

Yet another approach is to use MySQL replication with something as MMM or Flipper. I think this actually works best for most cases (unless async nature of MySQL replication is a showstopper) and it also solves the other big pain of semi-online MySQL/OS upgrades and Schema changes.

Backups SAN has number of advantages for operations (and I guess these are the guys which often push for it). It often would offer snapshots with low overhead allowing to take a backups conveniently. Also if you keep last uncompressed backup on SAN (or last snapshot) you may be able to minimize recovery time significantly by switching to that backup – no data copy will be required which can be significant concern to 1TB+ databases.

My Take: I view SAN as solution for niche circumstances when it comes to MySQL. It may be you need it but it is not silver bullet and best solution for all problems at once. When evaluating SAN you should also evaluate external directly attached storage, SSDs, Increasing system memory as well as sharding and replication as a scaling solutions.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Thanks, Peter. It is always very nice to get your question answered as blog post – I suspected that might happen :^) especially as I’m not the only one with this question these days..

  2. Peter,

    “MySQL also has broken group commit (which we have partial fix for) meaning concurrent transaction commits will need to be serialized.”

    I would also say, not only are they serialized (i.e. not done in parallel) but they are not grouped together, so there are many more physical I/O operations than if several transactions could be grouped into one commit. At least this is how I read the original group commit code from 4.x, though I haven’t read our patch yet… shame on me.

    About DRBD, don’t we usually advise only to keep the binlog on DRBD to avoid the most significant parts of the performance penalty you get from added write latency?

  3. Arkady Smirnovsky says:

    Thanks to you Peter. It would be interesting as well to see a similar analysis of “When would you use NAS with MySQL?”

  4. peter says:

    Arkady,

    Yes I should write than. With NAS things are close though there is typically higher performance hit and one has to be very careful with NFS bugs, locks etc.

  5. Peter,

    Thanks for this post. I believe the strongest points of SAN are:
    1. It’s ability to auto-extend: thus, when running out of disk space, you can purchase yet another SAN, and let the two sort things out between them
    2. Its snapshots capability
    3. The amount of disk space you can fit in… In local disks, assuming you’re using SAS/SCSI, you are quite limited in disk space. You usually only get up to 6 slots, two of which would normally go to OS, so 4 left for data. With RAID 10 this leaves only volume of two disks. With SATA you can get more capacity per disk, but I would normally consider it as unsafe.
    With SAN devices, you get to have many slots for disks, like 16 slots just for your data.
    4. You can share the storage between different machines. Not all of them have to be MySQL machines.

    I currently work with a company who just purchased a SAN device. Benchmarks they did came up with the results that max IO per second is not too far above that of local disks – that was a bit disappointing…

  6. Shlomi,

    Whether you need so much space I guess depends on what you’re actually using MySQL for. I find that for most applications I’m involved with, MySQL is fine with 6 SAS/SCSI disks in RAID 5 or RAID 10 before it needs to scale out. Though sure, it depends and I agree with points you mentioned. But what I find surprising is quite often people dedicate 2 disks in RAID-1 just for OS while OS uses these just to boot up (like couple times a year?), maybe swap out few pages once in a while and write a line of system log a second (which could well go to remote syslog device for security reasons). It is maybe nice from operations standpoint i.e. you can work with RAID online – format it, move disks around, but really how often people do it on production servers? What are other benefits in that on a dedicated MySQL machine?

  7. @Aurimas,

    The reasoning behind RAID 1 for OS is not performance at all, it’s just high availability. The reasoning says “OS is above all else and should always function”. If OS is down, so is MySQL, right?
    I usually get the smallest disks for OS (Used to be 36GB, nowdays 72GB is minimum for SAS).

    With regard to disk space: I’m working with companies who grew from 30GB to hundreds of GB within less than a year. So I guess you can never know…

    Regards

  8. Marki says:

    Entry-level SAN will probably don’t have much bigger performance than local drives. But everything “virtual” like EVA will be better performance, because it has a virtual RAID – if you have 20-200 drives installed, every LUN will be residing on all spindles. If you add new drives, leveling will move part of all LUNs also to these.
    Write performance should be comparable to local disks – writes are going only to cache, so time is: time to transfer data over fibre channel, write to cache, write to mirror cache, send ACK, time until ACK arrives to server – together should not be more than 1 ms.
    SAN should be fully redundant – 2 FC HBAs per server, 2 switches, array with connection to both switches (these are isolated different SAN networks, so in case one path fails, we still have another). If you really need HA, you have cluster between datacenters with arrays in both and use software (LVM) mirroring between them. Of course redundant ISL links between datacenters, each from different provider and physically different path. But of course we talking about lot of money here, so as everywhere – it is just the equation between what we want and what we are able to pay for it.

  9. Shlomi,

    sure, OS being on separate disks will not hurt, all I say – they could help a lot if MySQL had more disks to operate on i.e. 6 rather than 4 in the case with 6 slots you mentioned with OS sleeping on 5G in the corner anyway. Also, I’m not sure I understand reasoning behind using smallest disks for OS – except for the price of course, though difference ain’t significant these days.

    Regarding “OS is above all else and should always function” – on a dedicated MySQL server I see no big difference whether OS (and MySQL) is down or only MySQL in case of say 2-disk failure in RAID-5 – in both cases MySQL service is not available and you have to physically fix the server anyway (or do some trick over KVM if it is fixable). Filesystem corruption is different case, but sure I agree MySQL is better off on different file system than OS.

    Disk space needs indeed can be very dynamic, but quite often disk space and other resource needs for MySQL-backed applications correlate while scaling and I happen to find server needs more memory or CPUs [than it can accommodate] sooner. This is just my experience and it varies a lot from application to application – indeed “you never know” ;)

    Cheers!

  10. Shlomi,

    The limitations on local disks are usually because the manufacturers want to limit you, not because it’s impossible to do. They have higher margins on SAN and other external storage devices which are not commodity items. But some manufacturers (supermicro, for example) will let you put lots of disks inside the chassis itself.

  11. @Aurimas,
    With regard to sharing OS disks with data disks – those are good points. I myself am not a sys admin expert, and so tend to rely on conservative conventions. Will ask around, though, to learn of various opinions.
    Running out of memory before running out of disk space – well, those two are related, I guess. The larger the dataset, the larger the buffer pool you would want to allocate. As for running out of CPUs – with InnoDB you don’t get much gain over 8 cores (if at all?). Maybe Falcon would be different…

    @Baron,
    I always wondered why they don’t design our normal family vehicles after cool racing cars. It’s just the bending… Everything in our world revolves around that, I suppose…
    Good to head about ‘supermicro’. Will look them up. Need to check if they bring these to my country.

    Shlomi

  12. chad says:

    Peter, Baron,
    I’ve seen a few posts here mention MMM. Do you guys generally set things up in master-master? I guess I’ve always set things up in Master-Slave(*x), I’d worry that we’d run out of read capacity with just two nodes.

  13. peter says:

    Marki,

    I’m comparing the same number of drives. Note if you have SAN with 200 drives you will likely have multiple servers compared to it which all will share resources. For example it could be compared to 10 servers with shelf of 20 drives compared to it. Indeed because the servers may have spikes at different types there is the potential for the gains. With a lot of Sharded applications with MySQL the load is however very ballanced and predictable with spikes well aligned so gain is less. Also note the complication with performance analisis – Virtual volume is not predictable – sometimes you have performance of 200 drives dedicated to you and sometimes you may end up competing with 20 applications for the same.

    Now regarding write acknowledgment. Indeed you get no more than 1ms for SAN while you get something like 50 microseconds for the BBU on the local RAID – you just have PCI Express to reach the cache which is a bit shorter.

    And again you mention the cost here quite correctly. My point is SAN is rarely a path to cost efficient performance.

  14. peter says:

    Shomi,

    Regarding growth – if you use LVM you can also get another shelf of drives and so expand storage.
    Regarding drives the 8 is the classic limit these days though many offer external directly attached storage which is more expensive than just internal drives but far cheaper than SAN.

    What Baron tells about Drives is true. Check this out:
    http://www.supermicro.com/products/chassis/2U/216/SC216A-R900U.cfm

    This is 2U chassis from the inexpensive vendor which offers 24 drives

  15. @Peter, @Baron,

    Thanks a lot for the information, appreciate it. Have found distributors for the Supermicro machines. It’s pretty amazing to see this 24 drives machine.

    Peter: “if you use LVM you can also get another shelf of drives and so expand storage.” – but limited to local drives or external SCSI, right?

  16. Scott says:

    I would recommend avoiding Supermicro machines in production server enviroments. This past august I replaced 3 older super micro machines with IBM Servers. I prefer having technologies like RSA, Server raid, memory chip kill, ability to obtain parts fairly easly even after the server is 3 years old. Also with RSA you can restart the server if its hard locked as well as obtain a fairly good idea whats wrong with the server without visiting your data center. Server raid is a great tool as well, sends you notifications when something is wrong with drives or raid controller. Also we had a number of IBM servers that were purchased initally without power supplies and we had very little trouble tracking down additional power supplies for these machines even after 3 years. All i can say about supermicro after 3 years is good luck even finding a firmware update after 3 years.

    I pulled a number of supermicro servers out of production that were running fedora and i tried to replace with CentOS 4 and 5 and found that it was extremely hard to make the raid controllers on most of these machines work unlike the IBM machines. I’m not 100% sure how fedora was even installed on these machines, best guess is someone spent hours trying all kinds of experaments with drivers disk before find something that worked.

    Also i love being able to download all the firmware and drivers for the machines from one place and being able to punch in a machine type into a search box and get everything i need for the machine. If you want a server that holds lots of disk have a look at the sun storage servers that have 50 disks in them. I’m not a big fan of sun hardware due to some bad experiences with Sun V20 servers, but i like these machines. Were looking into deploying a vmware esxi solution for our App servers with NAS storage and these would be a great idea.

  17. Nils says:

    Scott:

    Usually Supermicro boxes are just commodity hardware, you don’t have to use their boards with the chassis. Also the choice of RAID Controller is usually yours to make (which is nice, given that there are at least some vendors who ship *decent* CLI tools for linux like 3ware, not just drivers). I don’t buy directly from supermicro though, I don’t know if you can even get a complete server (loaded with CPUs, disks, RAM and so on) from them. The reason to buy IBM, Dell, Sun or HP (forgot anyone?) is usually Service Level Agreements, you’ll always have the same guys to yell at when something breaks (be it disk, fan, PSU, whatever) ;)

  18. @Chad,

    Have a look at what we are doing with Tungsten Replicator (http://www.continuent.com/community/tungsten-replicator). One of the problems with multiple slaves is that you have difficulty promoting a slave to master because the other slaves are holding references to the original files in the old master’s binlog directory. We use global replication IDs which solves that problem. There are other solutions mentioned above (like DRBD); however, I think the slave promotion problem is a substantial reason for the popularity of MMM.

    Cheers, Robert

  19. JT Moree says:

    With a completely VM environment local disk is not an option. We can simulate local disk by adding a VM hard drive which sits on the san next to the OS disks. Has anyone compared VM disk to iscsi connected disk on SAN? pros, cons, performance?

Speak Your Mind

*