This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.

Some basic facts

  • Most of us know that writing into Innodb updates buffer pool pages in memory and records page operations in the transaction (redo) log.
  • Behind the scenes those updated (dirty) buffer pool pages are flushed down the to the tablespace.
  • If Innodb stops (read: crashes) with dirty buffer pool pages, Innodb recovery must be done to rebuild the last consistent picture of the database.
  • Recovery uses the transaction log by redoing (hence the name ‘redo log’) the page operations in the log that had not already been flushed to the tablespaces.

Ultimately this mechanism was an optimization for slow drives:  if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces.  Sequential IO trumps Random IO.

However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc.  This is because:

  • The same row may be written multiple times before the page is flushed
  • Multiple rows within the same page can be written before the page is flushed

Innodb Log Checkpointing

So, first of all, what can we see about Innodb log checkpointing and what does it tell us?

This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint.  The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion.    Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value.  More on what a Checkpoint is in a minute.    If you use Percona server, it does the math for you by including some more output:

Probably most interesting here is the Checkpoint age, which is the subtraction I described above.  I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers.  Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.

Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log.  We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.

Dirty Buffer Pool Pages

On the other side, we have dirty buffers.  These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:

So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool.  A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows.  Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.

As I said before, dirty pages are flushed to disk in the background.  The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log.    The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.

Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be  older transactions that are not flushed.  This, in essence, is what fuzzy checkpointing is.

The checkpoint process is really a logical operation.  It occasionally  (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint.  Everything older must be fully flushed.

The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations:  the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.

So, how can we optimize here?  The short of it is: make innodb flush more dirty pages.  However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages.   It is clear how that algorithm works without reading the source code.

There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:

  • innodb_max_dirty_pages_pct:  This attempts to keep the percentage of dirty pages under control, and before the Innodb plugin this was really the only way to tune dirty buffer flushing.  However, I have seen servers with 3% dirty buffers and they are hitting their max checkpoint age.  The way this increases dirty buffer flushing also doesn’t scale well on high io subsystems, it effectively just doubles the dirty buffer flushing per second when the % dirty pages exceeds this amount.
  • innodb_io_capacity: This setting, in spite of all our grand hopes that it would allow Innodb to make better use of our IO in all operations, simply controls the amount of dirty page flushing per second (and other background tasks like read-ahead).  Make this bigger, you flush more per second.  This does not adapt, it simply does that many iops every second if there are dirty buffers to flush.  It will effectively eliminate any optimization of IO consolidation if you have a low enough write workload (that is, dirty pages get flushed almost immediately, we might be better off without a transaction log in this case).  It also can quickly starve data reads and writes to the transaction log if you set this too high.
  • innodb_write_io_threads: Controls how many threads will have writes in progress to the disk.   I’m not sure why this is still useful if you can use Linux native AIO.  These can also be rendered useless by filesystems that don’t allow parallel writing to the same file by more than one thread (particularly if you have relatively few tables and/or use the global tablespaces) *cough ext3 cough*.
  • innodb_adaptive_flushing: An Innodb plugin/5.5 setting that tries to be smarter about flushing more aggressively based on the number of dirty pages and the rate of transaction log growth.
  • innodb_adaptive_flushing_method:  (Percona Server only)  This adds a few new different algorithms, but the more effective ones adjust the amount of dirty page flushing using a formula that considers the Checkpoint age and the Checkpoint age target (something you can set in Percona Server, otherwise it is effectively just a hair under the Max Checkpoint age).  The two main methods here would be ‘estimate’ (good default) and ‘keep_average’ designed for SSD type storage.  Running Percona Server and using this method is my go-to setting for managing Checkpoint age.
To be clear, the best we have today (IMO) is using innodb_adaptive_flushing_method in Percona server.
In any case, if you run any kind of production MySQL server, you should be monitoring your Checkpoint age and your Innodb dirty pages and try to see the relationship with those values and your write operations on disk.  The additional controls in 5.5 and Percona server are excellent reasons to consider an upgrade.

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Simon Martin

Please could you explain exactly what the difference between modifed age and checkpoint age is on the Percona server output?

Shenglin

I had the issue issue with mysql 5.1 version, unfortunately, i can’t tune those parameters besides decreasing innodb_max_dirty_pages_pct to 1 and increase log file size to 3.9G. As I understand,once the checkpoing ago is reaching 7/8 of innodb_log_file_size, flush process use sync instead of async and block all the write operations. How can I verify this? I tried to pstack(Solaris OS), and the insert is waiting for buf_flush_wait_batch_end ().

Shenglin

Thanks for your update, Janssen. During the issue, the statement of insert statements were “Query” and waiting for a few seconds. I just want to ensure the root cause is the sync flush, not others.

Kowl

I just confuse these question:
1. if committed transactions in transaction log flushed to data file, where there marked dirty pages in buffer pool will be decreased?
2. As I understand checkpoint that it is existed to do work as flush to disk in transaction log, isn’t it?
3. What does it variable mean of Innodb_buffer_pool_pages_flushed, does it in buffer pool to flush dirty pages then count one after flushed? what difference between this one and tranaction log checkpoint flush to disk?
4. If a small transaction log buffer like 256k and transaction file like 1M with two groups, and a long and long transaction happens and uncommitted(big transaction), so will older LSN in transaction file be overwrotten by new one after file full?

apologize for my english poor.

Kowl

sorry, above Innodb_buffer_pool_pages_flushed should be Innodb_data_fsyncs. Innodb_buffer_pool_pages_flushed is number of pages that be flushed.

Kowl

Thank you Janssen for your answers.

I had a test for a small redo log buffer(256k) and small redo log file (1MB) and a big transaction. This big transaction is a load data infile sql from a file 1.5GB within huge characters and load data infile into a table and single longblob column. finally, when I am sure lunch commit command a few seconds after load into table complete, then immediately force shutdown server, it happened situation that was not what I think ACID for innodb, nothing was exsited in the table after reboot server and restart mysql and recovery action done in err file. And weird is 1MB redo log could hold 1.5G characters insert transaction, and sharp checkpoint must run many times?

Kowl

MYSQL VERSION COMMUNITY 5.5.24

Ives Stoddard

@Jay, this is the most concise write-up of this mechanic i’ve seen yet. thank you for taking the time to post it.

this provides excellent information about half of the problem, mainly how to identify and monitor outstanding writes from dirty pages to disk.

as for the other half, understanding the impact when writes do occur, can you (or anyone else on this thread) articulate the best measures of performance impact during writes? (i.e. is it simply a matter of monitoring transactions or queries per second, or are there stats for things like time spent writing to disk, time waiting for a forced flush, etc.)

many thanks.

Alec

Jay, can you show where you got this in your article? None of the numbers equal to 80%…
“Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs”

Where do you show total number of bytes in all the transaction logs?

Will

@Alec, ‘Max checkpoint age’ is a MySQL-internal variable (log_sys -> max_checkpoint_age) that’s roughly 81% of the log group capacity, which itself is equal to ((innodb_log_file_size – 2048) * innodb_log_files_in_group). In stock MySQL, max_checkpoint_age is not reported in any status output, though it’s used by the internal flushing algorithm. Percona’s innodb-io patches expose the variable (unchanged) in its innodb status output.

The exact formula for max checkpoint age (source: https://git.io/vFJP4) is:
(log_group_capacity * 0.9 – (page_size * (48 + 4 * innodb_thread_concurrency) ) * 0.9

Since the per-thread headroom (48 + 4 * innodb_thread_concurrency pages) is typically very small (unless you have lots of OS threads configured), max checkpoint age usually ends up being roughly 81% of the total log group capacity.