Yesterday, over at my personal blog, I blogged about the impact of the MySQL slow query log. Since we’re working on Percona Server 5.6, I did wonder if this was a good opportunity to re-examine how we could provide slow query log type functionality to our users.

The slow query log code inside the MySQL server does several things that aren’t good for concurrency on modern systems. For starters, it takes a mutex while both doing formatting of what to write and while doing the write(2) system call itself. This is, in fact, entirely unneccesary.

From the man page for the write system call:

If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step.

So we can pretty easily see that for our case of writing the slow query log, if we had a buffer that was the log entry, we could just call write(2) and never take a mutex. We trade some memory usage (for the buffer) for concurrency – pushing all the concurrency problems down into the operating system.

The main challenge with this strategy is log file rotation, as if you just close(2) a file descriptor that others may be write(2)ing to then bad things may happen (namely that the file descriptor gets reused for something else.. say an innodb data file, and writing part of your slow query log as an innodb page will not end in hugs and puppies).

So, in order to test my theory, I sat down and wrote an audit plugin. It turns out I had to extend the MySQL plugin API to be able to reproduce all the information in the slow query log. The code for my prototype can be found at lp:~stewart/percona-server/5.6-slow-query-plugin-prototype (this revision) – this will probably build fine against 5.5 too, but I haven’t tested. To build it you’ll need the Intel Threading Building Blocks headers installed as I used their tbb::atomic<> implementation.

My solution to log file rotation is to basically count how many people may be using it and waiting until this drops to zero before closing the file descriptor. This avoids taking any locks around calling write(), which is the most time consuming part of writing the log. I’m not convinced if this is the best solution or not, but it appears to work and I think it’s race free.

No improvement like this is complete without benchmarks. My benchmarks (“mysqlslap -q ‘select 1;’ –number-of-queries=1000000 –concurrency=64” on my laptop) come out at the following:

  • straight MySQL (no slow query log): 13 seconds
  • MySQL with slow query log enabled: 18 seconds
  • My audit plugin slow query log: 15 seconds
  • straight Drizzle (no slow query log): 8 seconds

So that’s a quite promising start.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jonathan Levin

Doesn’t MySQL 5.6 log queries in its performance schema? Something called Statement digest?
If that is not what you are looking for, you can use Karlsson’s audit plugin http://karlssonondatabases.blogspot.co.uk/2010/03/more-fun-with-mysql-audit-plugin-api.html

This may not be the same as a slow log, but it is similar to slow log + pt-query-digest.

Peter Laursen

MySQL 5.6 does not exactly log individual queries and their sequence. It stores an ‘aggregated summary’ of queries executed. Literals in the query replaces with a wildcard. You will be able to query when such ‘normalized query was executed last time, how many times it executed etc., but not the exact sequence of queries. Also I think there is a ‘hard limit’ (a configuration option) on the number of such ‘normalized queries’ recorded.

The 5.6 P_S implementation is interesting, useful and probably enough for most purposes, but it does not have all details (and it cannot because if it did it would use tremendous amounts of memory!).

Davi Arnaut

For rotating, wouldn’t it be possible to use dup2(2) to atomically change the file associated with the file descriptor? In such a case it wouldn’t be necessary to track usage. Rotation could be a matter of renaming the log file, opening the new one and changing the file associated with the file descriptor used for logging.

Also, regarding the code, I can’t quite spot what prevents log_fd from being used after that usleep loop. What am I missing?

Peter Zaitsev

Any summary does not give you as much information as the log. There are a lot of data you can mine from the log, many of it can be specific to the case you’re analyzing at hand.

I think MySQL 5.6 information schema extensions is a great step forward but it is not replacement for log.

I also think showing queries in its normalized form is a bad decision – this makes it a lot harder to actually take the query to profile it, check out explain for it etc.

Stewart Smith

Davi, I think you spotted a bug. Basically, the “int fd= log_fd; log_fd= -1” should be *before* the usleep loop. This way everything after the usleep loop will be using the new file so it’s safe to close() the old fd.

I guess the question is if dup2() is atomic or not… it may work…

Daniël van Eeden

This reminds me of a Java application which performance was directly related to the speed of the filesystem where the stacktraces written to.

What about benchmarking with a /dev/shm filesystem? or with some very slow storage?

Stewart Smith

So, the problem with using dup(2) is the same as if just using open() and close().

Namely, the reading of log_fd for calling write() is itself atomic, but the retreiving the value combined with performing the write is not. There is a small window between getting the parameter for the system call and the system call running where we could context switch. If we switched to a thread running dup2(2), then it would close the fd, making the write(2) complete on a closed file descriptor (if we’re lucky – if we’re unlucky, somebody else opens an innodb data file and gets that fd).

Davi Arnaut

I’m not sure how the scenario you describe would occur with dup2 (dup2, not dup). If an open file descriptor is passed, the file descriptor is changed to refer to a different file. Note that even the function interface itself does not allow for a passed file descriptor number to change, so no other open() can possibly be allowed to get that file descriptor number (in a conforming implementation). Also, dup2 is required to be async-signal-safe, which further indicates that the function is safe even in the face of interruptions.

khan

Hi Stewart, did you perform benchmark when the destination of output to the slow query log is TABLE?

Stewart Smith

I have not. AFAIK we see this used very little, as the normal slow query log file is used as input to many existing tools (such as pt-query-digest)

Ángel

Davi, the problem lies if the multithreaded kernel calls close(fd) + fd = open(), and that code is itself reentrant (ie. the fd table or the entry for fd wasn’t locked). It’s quite unlikely that a OS kernel implements it that way, but it could be.

From dup2() POSIX man page:
Least obvious is the possible effect of a signal-catching function that
could be invoked between steps and allocate or deallocate file descrip‐
tors. This could be avoided by blocking signals.

The dup2() function is not intended for use in critical regions as a
synchronization mechanism.

The page for the Linux dup2() doesn’t make any guarantee, and in fact shows:
If newfd was open, any errors that would have been reported at close(2) time are
lost. A careful programmer will not use dup2() or dup3() without closing newfd
first.

However, looking at the errors, it clearly suggests that it’s not possible:
EBUSY (Linux only) This may be returned by dup2() or dup3() during a race condition
with open(2) and dup().

EINTR The dup2() or dup3() call was interrupted by a signal; see signal(7).

As for the way of solving, I agree that
int fd= log_fd;
log_fd= -1;
should have been before the loop.
Moreover, it should be before the which_nr_thds_using_fd change.

Do note that write_to_log() will happily report as written the lines being processed at the code between ensure_log_file_open() and the write() when other thread has just set log_fd to -1.

I’d probably place the call to ensure_log_file_open() of extended_slow_query_log_notify() at line 64, inside write_to_log() nr_thds_using_fd guard.

There’s another race if write_to_log() registers itself at the new which_nr_thds_using_fd but is using the old log_fd (thus switching log_fd before that).

log_fd= -1; should also be atomic, although you may have done it in that way on purpose, as gcc provides that.

I’m not sure if it would be desirable to change the stat() to a lstat(). I don’t see why someone would make a extended_slow_query_log.log.1 -> (somewhere/lese to change the folder of the rotated query log file (they couldn’t move them in that way to another partition -the only interesting use case I see- as rename would fail with EXDEV, and no file rotation would be taking place at all).
However, the log folder needs to be only writable by mysql to prevent the TOCTTOU, so only the admin could have been adding odd symlinks anyway.

Stewart Smith

r.e. “log_fd= -1; should also be atomic,”, the tbb atomic type provides that.