July 23, 2014

Storing MySQL Binary logs on NFS Volume

There is a lot of discussions whenever running MySQL storing data on NFS is a good idea. There is a lot of things for and against this and this post is not about them.
The fact is number of people run their databases on NetApp and other forms of NFS storage and this post is about one of discoveries in such setup.

There are good reasons to have binary logs on NFS volume – binary logs is exactly the thing you want to survive the server crash – using them you can do point in time recovery from backup.

I was testing high volume replication today using Sysbench:

On this box I got around 12.000 of updates/sec which is not the perfect number, though it mainly was because of contention issues in MySQL 5.0 rather than any NAS issues.
This number was reachable even with binary log stored on NFS volume. This number is for sync_binlog=0 and innodb_flush_log_at_trx_commit=2

I noted however if I enable replication – connect the slave to this box the throughput on the Master drops to about 2800 updates sec…. which is very close to the magic number how many network roundtrips per second I can get over 1Gb link. It was even more interesting when that. If I would pause replication for prolonged period of time and let few GB of binary logs to accumulate the performance on Master will be high even with replication running, but it will slow down as soon as IO thread on the slave is caught up with master.

When I moved the Binary logs to the local storage I got very similar performance but there have been no degradation when replication is enabled.

I have not checked in details why this could be the case but I guess there is something which requires a network roundtrip when the binary log is written at the same time as slave-feeding thread is reading it.

I’d be curious to know if someone else can observe such behavior and if there is an NFS tuning which can be done to avoid it or if we need to fix MySQL

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Peter Zaitsev says:

    William,

    In this case the point is rather the network roundtrip seems to be needed (at least in Linux client implementation) if file is to be extended. Your last sentence says it all though. In many cases people think moving to high end NFS will get far better performance than local storage.

  2. William Jimenez says:

    What type of file system was on this NFS server? Remember that NFS sits on top of the file system residing on the server you are exporting the share from. While I agree that NFS can be tricky, these aren’t plug and play technologies so the makeup of system stack holistically has a huge impact on the performance.

    NFS is one of those technologies that either performs very well or very poorly, depending on how you configure it. I understand that the point of this article is not to be advice as much as a discussion starter, so I am just adding some more color to the picture :-) .

    ZFS based NFS on Solaris based kernels have had very impressive results, something to look into. You won’t see anything like local or direct attached storage however….

  3. Patrick Casey says:

    Is there file system level locking going on? I don’t have the source in front of my, but its entirely plausible that:

    master can write away w/o locking into the binary log
    slave reader thread needs to get a lock through to read (so he doesn’t read partial records off the end of the log if he reads while master is writing)

    so having a slave in place adds a set of locking overhead and means that sometimes when the master decides to write, he won’t be able to b/c the slave reader has a lock and hence he has to wait, reducing master throughput

    My file system knowledge is way outdated, but I *think* NFS uses an external protocol for locking that requires even more roundtrips since I think the original NFS spec didn’t include a locking mechanic? That’d make the locking cost a lot higher on NFS than on, say, ext3.

  4. peter says:

    Patrick,

    I did not look at the code or traced in clearly so I do not know… though I think it is unlikely it is locking on file system level. Threads writing binary log and threads reading binary logs and sending events to the slave are part of the same process and they are synchronized by mutex.

  5. Rob Wultsch says:

    Peter,
    Do you consider a Linux server having a MySQL data dir on NFS safe? I have no first hand interaction with such a setup and have read numerous ominous reports. I am curious if this is a recipe for disaster.

  6. Yet another interesting artifact to debug. AFAIK, the thread on the master that pushes data to a slave gets the binlog lock, copies events out, releases the lock and then sends data to the slave. The Google patch has a small change to avoid allocating a buffer for each event copied from the binlog as that is done when the binlog lock is held. That makes a big difference when there are many slaves per master, but you aren’t running in that setup.

    This would be a good time to use http://poormansprofiler.org

    I prefer to use local storage but the MySQL community has yet to implement a tool to archive binlogs (write them locally, archive them to NFS). That tool is easier to do with recent changes in mysqlbinlog and Harrison will soon write about that.

  7. Ajit says:

    One idea to relieve the pressure on the master: mount the binlogs file-system ro on a different host. Run a mysql server on that host. And use that host as a mast to the slave.

    Ajit

  8. I do believe that the approach of using NFS for shipping binlogs is somewhat wrong. It has the advantage of shipping live, or almost live logs off the box, but it creates a severe dependency on the availability of the NFS server – let the NFS hang, and watch your database hang as well (writeable mounts should be hard,intr).

    I’d rather see a mechanism that lets me specify a UNIX shell command that is being invoked after a binlog cycle. That shell command would get the name of the old, just retired binlog file as a full pathname as $1 and is being run asynchronously as a forked and disowned process. The shell command typically would be a command that will then scp the old binlog offsite.

    Postgres happens to have such a mechanism for their WAL, and MySQL is sorely missing such a feature (which would be very easily implemented, btw).

  9. NFS is more than 20 years old. Why are we still concerned about NFS hangs? Does this still occur in production on modern Linux distros? By “occur” I mean that it occurs frequently enough for it to be an issue. There are lots of things that can go wrong and I won’t end up using anything if I am to avoid all failures. I know of people successfully using NFS on a large scale.

    Copying binlogs only after they are done isn’t good enough for me. I want to have as much of the binlog archived elsewhere to recover from the loss of a master. The loss of a master is more frequent for me than an NFS hang. And if NFS is the problem, I can tail the binlog to remote storage using something other than NFS.

  10. Patrick Casey says:

    I’ve had NFS mount points lock up on my production servers before, mounting a RedHat NFS point onto a RedHat server.
    Triggering factor is usually, but not always, some sort of network interrupt.
    I back up onto NFS mount points, but I don’t run anything realtime critical on them.

    From my perspective, putting the binlogs on an NFS server triples my failure domain.

    I lose the database if:
    1) DB server dies
    2) NFS mount point dies
    3) NFS server dies

    I’d rather risk losing a few binlogs than increase my risk of a service interruption. Naturally, everybody’s use case varies :)

  11. Where I work and have been working in the past, NFS fails more often than anything else. Also, the recovery is usually only through a box reboot instead of service restart or even SIGHUP, so this is also the worst possible behavior.

    Live binlog shipping exists as well – it is called replication (START SLAVE IO_THREAD, and STOP SLAVE SQL_THREAD, if you are really paranoid; or a time delayed slave, if MySQL were to integrate patches for that into replication after all). But a simple binlog_cycle_command, that would be really easy, and quite useful.

  12. Replication doesn’t archive the master’s binlog. Were that the case, then failing a slave to a new master without losing transactions would be trivial. Today you get to play the game of finding the offset on the new master that corresponds to the offset from the old master. Or you can use global transaction IDs from the Google patch.

    Replication is also much more expensive than archiving the binlog. I want to use both but I don’t want to pay for the hardware on a slave when all that I need is a binlog archive solution.

  13. Patrick Casey says:

    Mark, I think you still have an issue of binlog corruption even if you’ve put the binlogs on an NFS share don’t you? If I lose the network interconnect or the master dies or something else “bad” happens, its not guaranteed that I have a consistent binlog on the NFS server. I might have an incomplete record at the end for example if we were in the middle of a write when the failure occurred.

    So if your requirement is *absolutely no* transactions may be lost in a failure, then I don’t think remote mounting the binlogs gets you there, although its probably marginally better than just running a slave.

  14. Yes, this allows for loss of transactions. But losing transactions from the last second might be much better than losing them from the last hour. I need sync replication in MySQL to avoid that or use DRBD to make it less likely. MariaDB is working on that with Galera.

  15. I would prefer to use network attached or remote storage as the place where binlog archiving is done rather than where the binlogs are written in the first place.

    Default InnoDB/MySQL doesn’t report IO latency — I think Percona Server does and I know the Facebook patch also does. I assume that binlog sync to local storage accelerated by HW RAID write cache is much faster than binlog sync to NFS. Given the lack of group commit in InnoDB that can make a huge difference.

  16. I think that people are looking for different things here:

    1) get binlogs off the master to save space
    2) get binlogs off the master to keep them safe

    In my opinion, the best technology for item 2) is probably DRBD at the moment.

  17. peter says:

    Copying Binary logs after they rotate is easy enough. If you use NFS/GFS/DRBD volume for storage logs they are immediately available after Master crash which allows you to use that logs to bring Slave up to speed and have no transactions lost if you have sync_binlog=1

    Indeed the local storage is a lot faster you often can get over 10000 fsync/sec with local RAID while only some 3000 with NFS other 1GB ethernet.

  18. What type of file system was on this NFS server? Remember that NFS sits on top of the file system residing on the server you are exporting the share from. While I agree that NFS can be tricky, these aren’t plug and play technologies so the makeup of system stack holistically has a huge impact on the performance.

    NFS is one of those technologies that either performs very well or very poorly, depending on how you configure it. I understand that the point of this article is not to be advice as much as a discussion starter, so I am just adding some more color to the picture :-).

    ZFS based NFS on Solaris based kernels have had very impressive results, something to look into. You won’t see anything like local or direct attached storage however….

  19. William,

    In this case the point is rather the network roundtrip seems to be needed (at least in Linux client implementation) if file is to be extended. Your last sentence says it all though. In many cases people think moving to high end NFS will get far better performance than local storage.

Speak Your Mind

*