April 19, 2014

Living with backups

Everyone does backups. Usually it’s some nightly batch job that just dumps all MySQL tables into a text file or ordinarily copies the binary files from the data directory to a safe location. Obviously both ways involve much more complex operations than it would seem by my last sentence, but it is not important right now. Either way the data is out and ready to save someone’s life (or job at least). Unfortunately taking backup does not come free of any cost. On the contrary, it’s more like doing very heavy queries against each table in the database when mysqldump is used or reading a lot of data when copying physical files, so the price may actually be rather high. And the more effectively the server resources are utilized, the more that becomes a problem.

What happens when you try to get all the data?

The most obvious answer is that it needs to be read, through I/O requests, from a storage that it resides on. The storage is handling reads issued by the system, but only at an extremely limited rate. So when a task is reading a lot of data very quickly, just as the archiving process does when it runs, it causes a huge number of requests being pushed down to the disks and saturating the I/O quite easily. Naturally at the same time the database needs to perform all those regular tasks like serving queries, using and competing for the very same disks to read or write whatever comes so that your favorite website can still show up in the browser. Moreover, reads sent from backup process usually want many sequential blocks of data and such access pattern may be preferred by the I/O scheduler over random I/O coming from MySQL, but also such large I/O requests take significant time to complete and the way typical disks work prevents anything else being executed in the mean time. And so database often needs to wait much longer until disk operations are scheduled and executed which converts into slower queries execution and significantly degraded performance.

Anything else?

All modern systems usually use caching of whatever is being read from a storage. It allows to reduce I/O to such devices on frequently accessed information. After a successful read the block of data is placed in the cache and then served only from the memory should anything ask for the same block again. That happens for as long as it does not get flushed. MySQL obviously takes advantage of this functionality just as any other application and this is especially true for MyISAM tables which have dedicated buffers only to store indexes, while the actual data is always read from disk. The active portions of tables will likely be placed by the system in memory and kept there for a long time. Since memory access is way faster than any disk access, even with the fastest drives or RAID configurations, the performance gains are quite clear. Now going back to flushing the data out of the cache. It happens by replacing old and unused blocks with the newly read ones. And so the more new blocks come, the more older ones need to go away. Just imagine what happens to all the cached data during backup run when the process is reading several times more information than there is physical memory installed on the server – it’s not difficult to have a database of such size. Everything is wiped out and replaced by random “garbage” for no good reason. Since the hit ratio becomes worse as the cache is being filled with random information additional I/O occur slowing everything down even further.

What does it all mean?

There is a clear conflict between the regular database activities, which need fast response times, and doing backups which would gladly act as in all-you-can-eat bar. With a busy database server it may turn out that running a simple copy operation from MySQL data directory will result in a total disaster after MySQL stops responding to the incoming queries quick enough.

What can be done about it?

Nothing really when you are using mysqldump. You can play some tricks, but it’s mostly beyond your control.

 

Otherwise, when copying physical data files, in order not to saturate the I/O you can limit the rate at which data is being read. This is unfortunately not something you can do with standard Linux tools like cp, scp or tar. However for example rsync has the possibility to do that with –bwlimit=KBPS option. The problem with rsync is that it needs to build a list of files to transfer before it can take any action and this operation alone is often quite heavy on I/O and is not subject to any limits.

Some time ago we prepared a patch for tar that implements –read-rate=BytesPerSecond. In this case the advantage from using tar over rsync is that you can immediately, on the fly, create a compressed archive. For example:

This will read /mnt/snapshot/mysql directory at 15000000 bytes/s creating a TAR/GZ archive out of it and printing it out to the standard output, which is then is redirected through the pipe over SSH to a remote host where a script reads the standard input and writes the archive into a proper location (where would we be without one-liners)

 

On Linux there is also a utility called ionice. It allows to affect how I/O scheduler will be dealing with I/O request coming from a certain process. Giving the backup application low class or priority will cause it won’t be getting in the way of the database work so much.

Please cosult ionice man page for usage details, it’s really very simple to grasp and use. The restriction for this application to work is that the system must be using CFQ elevator algorithm, it does not work with others. But that is not really a problem since most modern systems run on CFQ by default and even if not, you can change in at runtime anyway. To check what is the current setting you need to query your block devices. In case of the SCSI sub-system (devices named sda, sdb, sdc, etc.) that can be done with:

In order to change you will just need to write to scheduler files with the name of the new scheduler:

That’s it, now you can enjoy experimenting with ionice.

 

It could be also possible to make a backup application that would not interfere with the system cache. By specifying O_DIRECT flag when opening a file, an application tells the system to bypass the cache for it during reads. This is so far only an idea since there are no tools I know of that would support it well. The problem is when accessing a file that was opened with O_DIRECT flag, the file needs to be aligned to the file system block size, so usually it means the size has to be divisible by 4096 in order to read it right. Even though it’s always the case for InnoDB tablespaces, other MySQL data files do not comply with this requirement. The trick could be used here perhaps to read the file with O_DIRECT up to the last full block and then only perform a regular cached read on the last few bytes and append them to the target file.

 

But even if all those precautions have been taken, there are still chances for performance problems to happen on the working instance of MySQL. Such danger may for example come from an unexpected spike in load or traffic, or even from quite expected spikes that you simply can’t do anything about. So what I thought could be done here as the next step was to constantly monitor the database status and if any problems were noticed, the monitoring would simply pause the copying. I did a simple Perl script to do just that. It works by sending signals that can either stop or resume the application that copies the data:

What this script does is to check MySQL status every second and in case the number of connected threads goes above 50, it sends STOP signal to the archiving process. Whenever the number of connected threads drops down back to 9 or less, the script sends CONT signal which resumes archiving. The levels are of course different for every MySQL instance, these are just examples. The cheks can also be more sphisticated or include things like processlist information, CPU load avergages, I/O load, etc.

In this case tar is actually launched by the monitoring script directly, but that’s not really necessary. It simply needs to know PID of the process to manage and have a way to know when it ends.

This is of course just a concept of what I’m using successfully in some difficut environments and you can try building a mechanism suiting your own needs based on that.

Maciek

P.S. If you know someone who does not care about backups, please let him know this URL for our data recovery services.

About Maciej Dobrzanski

An IT consultant with the primary focus on systems, databases and application stacks performance and scalability. Expert on open source technologies such as Linux, BSD, Apache, nginx, MySQL, and many more. Co-author of dba square - a blog about how to manage, scale, and optimize MySQL performance!

Comments

  1. Davy Campano says:

    Excellent article! I have been only using rsync so that I could take advantage of the –bwlimit flag because tar didn’t have a similar flag. This information is great. Thanks for providing it!

  2. Question: you’re archiving mysql binary data *while* mysql is running, and thus the data files are in an open state? I’ve always been under the impression that cold binary backups have to be done with MySQL off in order to get useable data for a backup. I don’t see how this will provide useful files for restore.

  3. Davy Campano says:

    @themattreid,

    This method can be done by using LVM and taking a file system snapshot while mysql is running, and then copy/tar from the snapshot. You can find some more info on this blog post:

    http://www.mysqlperformanceblog.com/2006/08/21/using-lvm-for-mysql-backup-and-replication-setup/

  4. Gotcha, I didn’t see LVM mentioned anywhere so I had to pose the question. We’ll be pushing out LVM for our DAS setups next quarter. Good stuff.

  5. My backups are always taken via lvm snapshot on the passive node of my DRBD cluster. I think of it like this – The I/O on both nodes of the cluster is going to undergo the same about of writes, but by reading the backup from the passive node I don’t interfere with any application reads on the active node. I also know that I don’t have to worry about my network bandwidth as I’m copying the backup off – the outward facing NIC on the passive node isn’t used by anything else.

    Of course, I haven’t experienced any node failures during the backup process. But I think everyone shares the risk that a catastrophic activity can occur during a backup.

    As far as implementation is concerned, I just have a wrapper around mylvmbackup that looks for the mysql socket and exits gracefully if the socket exists. The wrapper script is set in cron on both nodes, but it only runs on the node where MySQL _isn’t_ running.

  6. Pat says:

    If you’re using LVM for your backups though you need to issue: FLUSH TABLES WITH READLOCK prior to taking the snapshot.

    That operation, in turn, waits for any running queries to complete, meaning that if I’ve got a 60 second query in fligh when I issue the flush, my databaze is locked for 60 seconds until that query completes and the flush returns.

    Does anybody know a way around this problem? The apps I work with really can’t tolerate the potential for a 60 second outage every time we do an LVM snapshot.

  7. Davy Campano says:

    Pat,

    If you are using all Innodb tables, then you don’t need to issue the FLUSH TABLES WITH READLOCK since when you start from the backup it will do a restore as if mysql had crashed. Otherwise, you may be able to issue the command and check how long it is taking and kill it after a certain threshold and try again.

  8. Justin says:

    I’ve been waiting for an article hitting these problems for ages.

    To some extent they are solved by having a disk system that has bandwidth to spare
    and can do background volume snapshot, however for the many sites where this is
    not available, these techniques are vital.

    The worst problem IMO is the way large backup operations quickly fill the database
    server memory (OS) file cache with data, pushing all the important stuff out. Without
    the O_DIRECT patch, this is going to happen .. and if your website is unusably slow
    for the first 5 or 10 minutes after it is brought online (while the OS cache is filling up)
    then it will also behave this way during a backup!

    The trick of halting and restoring the backup process while monitoring the mysql run
    queue is good but the risk there is some period of quiet can quickly flush the entire
    OS cache as the backup runs, and then the first large query that comes along is
    halted for 1, 5 or 10+ seconds when it would normally be resolved instantly. It might
    be better to run the backup with no bandwidth limit but interrupt it a fixed intervals
    eg, every 10 seconds, run for 1 second, knowing that over all traffic patterns, this is
    low impact.

    Unfortunately it is also very slow and will backup very inconsistent data that spans
    maybe even hours ..

  9. James Cohen says:

    Here’s a little trick I thought I’d share.

    We have a fairly large amount of data in several different databases. Lots of the data is archived daily tables and load of other tables that get read often but not written to very often.

    I wrote a small script in PHP (it’d be easy to rustle something up in most languages) that iterates through the tables in a database and compares the Update_time from SHOW TABLE STATUS with the mtime of back up files (one per table) on the file system. This incremental backup method works really well for us.

  10. Matt says:

    Another fantastic article – many thanks!

    I have a large (well over the TB mark) mostly MyISAM database I can’t take down to backup, and the extra disk write overhead from snapshotting rules out that method too. My strategy is to have a slave server purely for backups which uses a perl script to do the following:

    1. FLUSH TABLES WITH READ LOCK
    2. SHOW SLAVE STATUS and record the Relay_Master_Log_File and Exec_Master_Log_Pos values. These are almost as important as the data itself as you can’t easily restore a slave without them
    2. Snapshot the filesystem (FreeBSD/UFS2)
    3. UNLOCK TABLES
    4. Mount the snapshot
    5. Copy the directory tree from the snapshot to a backup device, using pbzip2 -9 (http://compression.ca/pbzip2) for any files over 100KB as it can reduce the backup size by around 50% over gzip.
    6. Clean up

    It’s not the cheapest backup method around, but as far as I see it’s the only sane way of backing up large DBs without downtime or spending $$$ – we use RAID cards and disk shelves to keep costs down,

    On my setup the I/O rate is such that pbzip2 doesn’t /quite/ saturate the CPUs so it’s perfectly balanced, but if you copy my methodology there’s scope for tuning by varying the compression level if your CPUs are maxed or by keeping the db locked during the whole backup process if disk I/O is your bottleneck.

    Hope someone finds this useful

  11. Maciej Dobrzanski says:

    James,

    That is a good idea for some environments, however limited to MyISAM-only databases. The other problem often may be that the largest tables, those which are most difficult to backup, are the ones being constantly updated.

  12. James Cohen says:

    Thanks for the reply Maciej

    I didn’t realise that innodb always returns a NULL Update_time. Luckily we’re using MyISAM largely in that environment.

    James

  13. Maciej Dobrzanski says:

    Matt,

    Surely backing up off the slave is always a neat solution where possible.

    I’ve never used pbzip2, but I am rather avoiding regular bzip2 for anything bigger as the compression is painfully slow while backups should be done within a limited time frame (i.e. nightly hours). lzop is fast, but it won’t produce as good ratio as bzip2 or even gzip.

    Maciek

  14. Matt says:

    Maciej,

    I used to use gzip because as you said, bzip2 is a lot slower. However with quad core CPUs the parallel bzip2 actually turns out to be faster, even with the -9 flag. Running pbzip2 under ‘nice’ makes sure the DB’s not too starved of CPU time as well although as I’m sure you’re aware, a write-only DB is mostly I/O bound. Also if you’re backing up to a slow storage device (for instance a USB disk or a slow fileserver), pbzip2 can return a substantial time saving as you’re chucking around half the amount of data.

  15. peter says:

    Matt,

    Using slave for backups is good though you always should consider couple of things

    - good replication (and lag monitoring)
    - checks to ensure master and slaves are fully in sync.

    With maatkit now able to do non-intrusive checks we’ve been checking many replication pairs and the ones which are up and running for a few months quite frequently would have some mild data inconsistencies.

  16. peter says:

    Using bandwidth limit in rsync or compression speed (with low priority) as a way to offload the system is a good strategy. Though it does not work in all cases. For example if you use rsync to really sync the backup with old backup version quite commonly you will have a lot of disk IO with little bandwidth used to only transfer the data which is actually changed. The compression can lead you into the trouble when you have a lot of small files plus it is not very flexible in picking the pace you want backups to be happening at.

  17. dermoth says:

    I played recently with backups while performing server upgrades… My new backup slave performs the whole backup (~60GB uncompressed dump, though using on-the-fly compression with Gzip the file end up slightly over 6GB) in single-transaction while keeping up in replication. Interestingly though, what’s hurting more the server (and sometime tripping off the replication alarms) is the rsync job during the checksuming phase. Quick testing showed that a single process saturating the server with I/O reads cause replicating to slightly lag behind.

    I can most likely avoid the checksums in rsync (haven’t tested yet since it’s not really annoying – I think –whole-file does the trick) though I tested various schedulers/setting do see how it affects the results.

    Originally running Linux-2.6.22, I had to upgrade to 2.6.27 due to a bug in CFQ: performing read I/O in the Idle was making *both* dd (Idle class) and MySQL (default class) to slow down to ridiculously low speed!

    - Using the deadline scheduler (which was the default) gives the best performance in all tested kernels – side read operation are fast and MySQL starts lagging slightly.

    - Using CFQ with default class, side read operation is about twice as fast while MySQL replication is nearly halted.

    - Using CFQ with side I/O in the Idle class (on 2.6.27), side read operations are still a bit faster than using deadline, and MySQL is lagging behind significantly more than using deadline.

    The IO subsystem is a hardware RAID 0+1 (3 stripped RAID1), 6x 10k SCSI-320 disks, 128M controller cache without write caching.

    I think that might be of interest to others.

  18. Vitaly says:

    Thanks for the interesting article!
    I use my DB server as backup/archiving server for some data (no MySQL related).
    I understand that MySQL will respond slowly during when some IO-demanding processes are running (gzip, etc…).
    But many times my MySQL DB server just isn’t responding – at least, the same query that usually takes about 2 secs, isn’t responding after few *minutes*!
    I played with “ionice for poors” perl script and at the end I’ll probably move all non-DB thins to another box, but I’m just curious – is this some bug in scheduler? in MySQL? or this is expected behaviour?

  19. dermoth says:

    Vitaly, you should read my post above… There’s pretty much the answer to your question there.

    Nice value affects only processor scheduler and has no effect on IO-intensive tasks. IO classes (ionice) works only with CFQ, and as I point out in that post, even when using IO classes to give priority to MySQL, CFQ makes MySQL run slower than Deadline on my hardware.

    I haven’t tested with the Anticipatory scheduler, however it should be even worse by design; Anticipatory should make Mysql wait even more on IO expecting sequential reads; while it will likely make random requests.

    Once you find the fastest scheduler for your load/hardware, you may want to tune the scheduler parameters to see if you can get any better results. These are in /sys/block//queue/iosched/.

  20. Vitaly says:

    Thank you, I read your post, probably I just didn’t understand it :-)
    First, I’m still with RHEL4 (2.6.9-42.0.3.ELsmp) – it means, I cannot switch between schedulers.
    Second – my question is mostly why slowing down is *so* drastic, in many cases it seems like mysql is stucked at all.

    Thank you!

  21. dermoth says:

    Vitaly,

    That may be a bug…. In 2.6.22 I had one where running I/O-intensive applications in the IDLE class would slow down everything (both MySQL and the IDLE application were running much slower than when they were both in the default class).

    Check the crond scripts or use ionice to verify you’re not using classes already (i.e. the system may already use the IDLE class for I/O-intensive jobs)… Other than the my only suggestion would be to upgrade.

  22. Vitaly says:

    Thank you!
    In fact, I’m migrating my backup scripts to another host now.

  23. Kattare says:

    Is there anything wrong with doing a slave stop / run the backup (mysqldump & filesystem both) / slave start procedure on a backup slave?

    We lean in that direction whenever possible and I was surprised to not see it mentioned here anywhere…

    Has anyone else noticed that rsync’d ibdata files don’t seem to maintain their integrity whereas an scp’d one does? Maybe rsync just doesn’t handle them well when they get over a certain size? I can’t seem to figure it out.

  24. Kattare, nothing is wrong with that, if you can do that. Lots of people can’t just take a slave offline like that, though.

    You CANNOT rsync InnoDB’s data with the server running, period. This is a common mistake. InnoDB’s background threads are busily writing data, even if no queries are being executed. If you wait long enough all the pending work will get done, but that’ll take an indeterminate amount of time. Just shut down first.

  25. Kattare says:

    Baron, I understand the implications of doing filesystem backups while MySQL is running. I was referring to rsync’s done with MySQL shutdown. EX, when I’m trying to setup a new slave. I’ll shutdown the master, rsync off the data dir, (incl, ~30G ibdata1 file) spin up the master, spin up the slave, attach the slave to the master, and find out (usually within an hour or so) that some queries are failing on the slave because of some kind of corruption in the innodb tables. I’ve seen this behavior with several versions of rsync, with several versions of mysql, on several different linux distros, and on many different pieces of hardware. When I use scp, it works fine. When I use rsync, it “reliably” fails. My theory is that there’s something going on in rsync when the file gets over 4GB, but that’s a very loose theory.

    With every new version of rsync I get excited and try my test case again, but sooner or later I think i’m going to have to put together a bug report for the rsync guys.

  26. Kattare, I see. Have you tried to md5sum files to check if they are the same after rsync?

  27. Kattare says:

    Baron, I went ahead and did some tests. The files were coming out with different md5sums. I expanded my tests to different server model pairs and discovered that some servers worked, others didn’t. They’re all running the same OS, so I’m not sure what the common thread is yet. Whether it’s a cpu issue, mobo issue, memory issue, or “2.6.XX kernel with XX options compiled in on XX hardware” kind of issue. We run a lot of commodity hardware here so it’s not uncommon to find oddities every here and there, but it’s been a while now since I’ve run into one like this.

Speak Your Mind

*