Beware: ext3 and sync-binlog do not play well together

One of our customers reported strange problem with MySQL having extremely poor performance when sync-binlog=1 is enabled, even though the system with RAID and BBU were expected to have much better performance.

The problem could be repeated with SysBench as follows:

./sysbench --num-threads=2 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb  --mysql-user=root --max-requests=0 --max-time=60 --mysql-db=test run

./sysbench --num-threads=2 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb --mysql-user=root --max-requests=0 --max-time=60 --mysql-db=test run

On Dell R900 with CentOS 5.2 and ext3 filesystem we get 1060 transactions/sec with single thread and sync_binlog=1 while with 2 threads it drops to 610 transactions/sec Because of synchronous serialized writes to binlog we’re not expecting significant improvements in transaction rate (as the load is binlog fsync bound) but we surely do not expect it to drop ether.

To ensure this is not RAID controller related we repeated the run on the in memory block device and got very similar results – 2350 transactions/sec with 2 threads with sync-binlog=0 and just 750 transactions/sec with sync-binlog=1

We tried running data=writeback and data=journal just to make sure but results were basically the same.

Using XFS instead of EXT3 gives expected results – we get 2350 transactions/sec with sync_binlog=1 and 2550 transactions/sec with sync-binlog=0 which is about 10% overhead.

EXT2 filesystem also behaves similar to XFS so it seems to be EXT3 specific performance issue. We do not know if it is CentOS/RHEL specific and if it is fixed in the recent kernels. If somebody can run the same test with different kernels it would be interesting to know results.

You also may wonder how could this problem be unnoticed for a while ? First it only applies to the binary log. Innodb transactional log does not have the same problem on EXT3. I expect this is because Innodb log is pre-allocated so there is not need to modify meta data with each write while binary log is written as growing file. Another possible reason is small sizes of writes which are being fsync’ed with binary log. Though these are just my speculations – we did not investigate it in details.

Another reason is – it only happens with high transaction commit rate. If you’re just running couple of hundreds of durable commits a second you may not notice this problem.

Third – sync-binlog=1 is an option great for safety but it is surely not the most commonly used one. The web applications which often have highest transaction rates typically do not have so strong durability requirements to run this option.

As the call for action – I would surely like someone to see if EXT3 can be fixed in this regard as it is by far the most common file system for Linux. It is also worth to investigate if something can be done on MySQL side – may be opening binlog with O_DSYNC flag if sync_binlog=1 instead of using fsync will help ? Or may be binlog pre-allocation would be good solution.

20 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Mark Callaghan

15 years ago

Are log files for Falcon and Maria pre-allocated?

Peter Zaitsev

Author

15 years ago

Maria logs at least are not. I spoke to Monty about this and he plans to do it on optimization stage.

Brice Figureau

15 years ago

Hi,

I’m not 100% sure, but I think EXT3 flushes all the transactions affecting the filesystem when you fsync a file and not only the transactions affecting the fsynced file. That could be slow if said filesystem has seen lots of changes all over.

Also, there was a long debate 6 months ago or so about Firefox RC1, etx3 and fsync:
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1940114

Anyway, I’ll try your sysbench to see if I can confirm the issue, because that could really be the problem I was seeing when I originally reported the kernel bug #7372:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

Angelo Vargas

15 years ago

Who’s command is that? an specific program ?

./sysbench

Davy

15 years ago

Thanks for the post Peter. I have been noticing some weird iostat output that shows 4 – 6 MB/sec writes to a logical volume that only has our binary logs on it, even though the logs were only growing at about 1 MB every 8 seconds. After turning sync_binlog off, this behavior stopped and now iostat reports around 0.13 MB/sec writes to that volume.

Peter Zaitsev

Author

15 years ago

Angelo,

Yes this is the program we quite commonly use for MySQL benchmarks. Fownload it here:

http://sourceforge.net/projects/sysbench

Peter Zaitsev

Author

15 years ago

Brice,

Yes as I remember ext3 has to flush its log when we have fsync on the file. This is by design as log file contains mix of transactions from various files (meta data changes typically) and just has to be flushed to the point. Same actually happens with Innodb – when you commit transaction changes from other transactions in the log also have to be flushed.

But I do not think this is the deal here – the single binary log file on the volume still has the same problem.

Stewart Smith

15 years ago

on ext3, think of fsync==sync.

The only engine inside MySQL that does pre-allocation correctly is NDB.

Peter Zaitsev

Author

15 years ago

Steward,

You’re saying FSYNC == O_SYNC for opening files ?

Speaking about NDB I assume you mean besides Innodb which has fixed sized logs…. BTW even for tablespaces Innodb uses background pre-allocation.

Stewart Smith

15 years ago

well… fsync()==/bin/sync

innodb does “preallocate” but just by writing out the files. NDB (on XFS… or recent glibc and kernel) call the filesystem’s fallocate function (xfsctl or if posix_fallocate is implemented correctly in libc and kernel) to ask it to preallocate the disk space (which usually results in much less fragmented files on any remotely aged file system)

Peter Zaitsev

Author

15 years ago

Steward,

Are you sure about fsync()==/bin/sync ? I know it does flushes the log but I think I commonly see “dirty” pages in the cache on server running Innodb on ext3 while running tens of durable commits per seconds which would keep this number very low if it would be full syncs.

Speaking about Pre-Allocation I think it is different story. This helps with fragmentation by reserving the space. But if files change their size with each write you get important meta data updates all the time. Though I guess it would be interesting to just see how it affects performance – growing file vs fallocate preallocated vs physically preallocated.

David Nguyen

15 years ago

I’ve ran several sysbench tests with ext3, ext2, and XFS. I noticed a huge performance hit with ext3 and sync_binlog=1 and only a ~10% hit with ext2 (just as you have posted above). However, I was not able to get good numbers with XFS on Debian. What are some of the recommended XFS tuning parameters?

Vadim

15 years ago

David,

You may mount XFS partition with -o nobarrier option.

Stewart Smith

15 years ago

with -onobarrier, be sure that you have write cache disabled for the drive – otherwise you *will* get filesystem corruption on crash.

Note that ext3, by default, does *NOT* use write barriers – so it’s actually dangerous if write cache is enabled.

David Nguyen

15 years ago

Thanks for your quick replies. I am running MySQL on a Dell PowerEdge 2950 with PERC6 (with BBU) 6 x 15K drives configured with Raid 10.

Is the recommendation to:

1) mount XFS with -onobarrier
2) write cache disabled for the drive
3) write-back enabled for the controller

Stewart Smith

15 years ago

It can depend on the performance characteristics of the device. with battery backed RAID cache, barriers should be the best performing setup as you will want to be using the write cache.

Steve Katz

14 years ago

Does this same problem exist on CentOS 5.4?

Thomas Guyot

13 years ago

Stewart, barriers are used for storage devices that have an unprotected write cache, and it has to be supported by the device. I believe is can have adverse effects on a device with protected write cache (I saw at least one report where it was much worse than disabling write cache and barriers altogether), and it’s definitely not needed. FWIW, if you use LVM on old kernels (<2.6.29 iirc) barriers will be ignored anyway.

David's post is what I do to have good reliability, and my systems survived a wide range of failures so far… I use Reiserfs which has barriers disabled by default (and my LVM version doesn't support them neither).

bwillits

15 years ago

Did some work on this today. We were getting 3500 qps replicating only with sync_binlog=1 and innodb_flush_at_trx_commit = 1. This is writing to a HD array of 6 SAS Drives in Raid 10. changing the values to 0 and 2 respectively increased this to almost 11k qps. All on ext3.

Moving to xfs, we got almost identical values with the same configuration. The Databases themselves live on a md0 of 4 FusionIO 640GB Drives.

tushar

10 years ago

hey peter,

can you please tell me why are you not using tpc-c for benchmarking.As it is more strict than that of sysbench.
i’d used both the benchmarks and tpcc was showing the result that is less than 4% transsactions per minute than that off sysbench,…..using 100 warehouses and 32 connections

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Beware: ext3 and sync-binlog do not play well together

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Configurations and Persistent Setting of the Key Parameters

WiredTiger Logging and Checkpoint Mechanism

InnoDB Performance Optimization Basics

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Beware: ext3 and sync-binlog do not play well together

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Configurations and Persistent Setting of the Key Parameters

WiredTiger Logging and Checkpoint Mechanism

InnoDB Performance Optimization Basics

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation