This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.
Some basic facts
- Most of us know that writing into Innodb updates buffer pool pages in memory and records page operations in the transaction (redo) log.
- Behind the scenes those updated (dirty) buffer pool pages are flushed down the to the tablespace.
- If Innodb stops (read: crashes) with dirty buffer pool pages, Innodb recovery must be done to rebuild the last consistent picture of the database.
- Recovery uses the transaction log by redoing (hence the name ‘redo log’) the page operations in the log that had not already been flushed to the tablespaces.
Ultimately this mechanism was an optimization for slow drives: if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces. Sequential IO trumps Random IO.
However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc. This is because:
- The same row may be written multiple times before the page is flushed
- Multiple rows within the same page can be written before the page is flushed
Innodb Log Checkpointing
So, first of all, what can we see about Innodb log checkpointing and what does it tell us?
1 2 3 4 5 6 7 | mysql> SHOW ENGINE INNODB STATUSG --- LOG --- Log sequence number 9682004056 Log flushed up to 9682004056 Last checkpoint at 9682002296 |
This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint. The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion. Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value. More on what a Checkpoint is in a minute. If you use Percona server, it does the math for you by including some more output:
1 2 3 4 5 6 7 8 9 10 | --- LOG --- Log sequence number 9682004056 Log flushed up to 9682004056 Last checkpoint at 9682002296 Max checkpoint age 108005254 Checkpoint age target 104630090 Modified age 1760 Checkpoint age 1760 |
Probably most interesting here is the Checkpoint age, which is the subtraction I described above. I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers. Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.
Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log. We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.
Dirty Buffer Pool Pages
On the other side, we have dirty buffers. These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:
1 2 3 | Database pages 65530 ... Modified db pages 3 |
So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool. A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows. Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.
As I said before, dirty pages are flushed to disk in the background. The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log. The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.
Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be older transactions that are not flushed. This, in essence, is what fuzzy checkpointing is.
The checkpoint process is really a logical operation. It occasionally (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint. Everything older must be fully flushed.
The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations: the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.
So, how can we optimize here? The short of it is: make innodb flush more dirty pages. However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages. It is clear how that algorithm works without reading the source code.
There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:
- innodb_max_dirty_pages_pct: This attempts to keep the percentage of dirty pages under control, and before the Innodb plugin this was really the only way to tune dirty buffer flushing. However, I have seen servers with 3% dirty buffers and they are hitting their max checkpoint age. The way this increases dirty buffer flushing also doesn’t scale well on high io subsystems, it effectively just doubles the dirty buffer flushing per second when the % dirty pages exceeds this amount.
- innodb_io_capacity: This setting, in spite of all our grand hopes that it would allow Innodb to make better use of our IO in all operations, simply controls the amount of dirty page flushing per second (and other background tasks like read-ahead). Make this bigger, you flush more per second. This does not adapt, it simply does that many iops every second if there are dirty buffers to flush. It will effectively eliminate any optimization of IO consolidation if you have a low enough write workload (that is, dirty pages get flushed almost immediately, we might be better off without a transaction log in this case). It also can quickly starve data reads and writes to the transaction log if you set this too high.
- innodb_write_io_threads: Controls how many threads will have writes in progress to the disk. I’m not sure why this is still useful if you can use Linux native AIO. These can also be rendered useless by filesystems that don’t allow parallel writing to the same file by more than one thread (particularly if you have relatively few tables and/or use the global tablespaces) *cough ext3 cough*.
- innodb_adaptive_flushing: An Innodb plugin/5.5 setting that tries to be smarter about flushing more aggressively based on the number of dirty pages and the rate of transaction log growth.
- innodb_adaptive_flushing_method: (Percona Server only) This adds a few new different algorithms, but the more effective ones adjust the amount of dirty page flushing using a formula that considers the Checkpoint age and the Checkpoint age target (something you can set in Percona Server, otherwise it is effectively just a hair under the Max Checkpoint age). The two main methods here would be ‘estimate’ (good default) and ‘keep_average’ designed for SSD type storage. Running Percona Server and using this method is my go-to setting for managing Checkpoint age.
Please could you explain exactly what the difference between modifed age and checkpoint age is on the Percona server output?
Simon,
It’s a good question, and I’ve wondered that too. I need to dig into the code further, but from what I can tell at a high-level glance is that ‘Modified age’ is the smallest LSN in the flush list (i.e., in the list of dirty buffers), where ‘Checkpoint age’ is where it is actually checkpointed to. Checkpoint age could be older if a checkpoint simply hasn’t happened.
Checkpointing takes the ‘Modified age’ and then flushes it to the log files (if the Innodb code comments can be believed), so it’s an official record that verifies what LSN was really flushed for crash recovery purposes. The difference between the two being that information being stored in the logs.
I had the issue issue with mysql 5.1 version, unfortunately, i can’t tune those parameters besides decreasing innodb_max_dirty_pages_pct to 1 and increase log file size to 3.9G. As I understand,once the checkpoing ago is reaching 7/8 of innodb_log_file_size, flush process use sync instead of async and block all the write operations. How can I verify this? I tried to pstack(Solaris OS), and the insert is waiting for buf_flush_wait_batch_end ().
Shenglin, it’s pretty obvious when it starts happening, because your query throughput gets very bad.
Even on 5.1, you may consider using the Innodb plugin, which has some of these variables: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-parameters-new.html
http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-plugin-installation-overview.html
Thanks for your update, Janssen. During the issue, the statement of insert statements were “Query” and waiting for a few seconds. I just want to ensure the root cause is the sync flush, not others.
I just confuse these question:
1. if committed transactions in transaction log flushed to data file, where there marked dirty pages in buffer pool will be decreased?
2. As I understand checkpoint that it is existed to do work as flush to disk in transaction log, isn’t it?
3. What does it variable mean of Innodb_buffer_pool_pages_flushed, does it in buffer pool to flush dirty pages then count one after flushed? what difference between this one and tranaction log checkpoint flush to disk?
4. If a small transaction log buffer like 256k and transaction file like 1M with two groups, and a long and long transaction happens and uncommitted(big transaction), so will older LSN in transaction file be overwrotten by new one after file full?
apologize for my english poor.
sorry, above Innodb_buffer_pool_pages_flushed should be Innodb_data_fsyncs. Innodb_buffer_pool_pages_flushed is number of pages that be flushed.
Kowl, you asked:
> 1. if committed transactions in transaction log flushed to data file, where there marked dirty pages in buffer pool will be
> decreased?
The dirty page count should go down on flushing (though remember that new writes will keep increasing the dirty page count as well).
Don’t think of it as the transaction log flushing transactions: the dirty buffer page flushing happens independently from transactional considerations.
> 2. As I understand checkpoint that it is existed to do work as flush to disk in transaction log, isn’t it?
The concept of checkpointing is a bit unclear to me as well. However, AFAIK the transaction log is never written to except sequentially as transactions are being executed and committed. A checkpoint (from my understanding) is a logical operation wherein Innodb tracks the position in the transaction log at which every transaction’s change before has been fully flushed. I don’t believe it actually writes into the transaction log.
> 3. What does it variable mean of Innodb_buffer_pool_pages_flushed, does it in buffer pool to flush dirty pages then count one after flushed?
Yes.
> what difference between this one and tranaction log checkpoint flush to disk?
> sorry, above Innodb_buffer_pool_pages_flushed should be Innodb_data_fsyncs.
> Innodb_buffer_pool_pages_flushed is number of pages that be flushed.
An fsync is not the same as a write. The fsync happens after a write to tell the OS to ensure the data was flushed to disk. There might be several writes for a single fsync.
Beyond that, I’m not sure if I’m answering your question well.
> 4. If a small transaction log buffer like 256k and transaction file like 1M with two groups, and a long and long transaction
> happens and uncommitted(big transaction), so will older LSN in transaction file be overwrotten by new one after file full?
This is a good question. The transaction log buffer just holds pending writes to the log file, if the transaction is larger than that, then multiple writes to the log file are necessary before commit happens. Innodb can handle uncommitted transactions in the log file correctly.
The second part of your question is about what happens when you have a big transaction and very small log files. I’m honestly not sure, but the default log file size is so small (5MB x2 logs) that I’m sure it’s handled in some way. The manual simply states that more checkpointing activity will be needed:
https://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_log_file_size
I would not recommend running Innodb with transaction log files so small.
Thank you Janssen for your answers.
I had a test for a small redo log buffer(256k) and small redo log file (1MB) and a big transaction. This big transaction is a load data infile sql from a file 1.5GB within huge characters and load data infile into a table and single longblob column. finally, when I am sure lunch commit command a few seconds after load into table complete, then immediately force shutdown server, it happened situation that was not what I think ACID for innodb, nothing was exsited in the table after reboot server and restart mysql and recovery action done in err file. And weird is 1MB redo log could hold 1.5G characters insert transaction, and sharp checkpoint must run many times?
MYSQL VERSION COMMUNITY 5.5.24
@Jay, this is the most concise write-up of this mechanic i’ve seen yet. thank you for taking the time to post it.
this provides excellent information about half of the problem, mainly how to identify and monitor outstanding writes from dirty pages to disk.
as for the other half, understanding the impact when writes do occur, can you (or anyone else on this thread) articulate the best measures of performance impact during writes? (i.e. is it simply a matter of monitoring transactions or queries per second, or are there stats for things like time spent writing to disk, time waiting for a forced flush, etc.)
many thanks.
Jay, can you show where you got this in your article? None of the numbers equal to 80%…
“Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs”
Where do you show total number of bytes in all the transaction logs?
@Alec, ‘Max checkpoint age’ is a MySQL-internal variable (
log_sys -> max_checkpoint_age
) that’s roughly 81% of the log group capacity, which itself is equal to ((innodb_log_file_size – 2048) * innodb_log_files_in_group). In stock MySQL, max_checkpoint_age is not reported in any status output, though it’s used by the internal flushing algorithm. Percona’s innodb-io patches expose the variable (unchanged) in its innodb status output.The exact formula for max checkpoint age (source: https://git.io/vFJP4) is:
(log_group_capacity * 0.9 – (page_size * (48 + 4 * innodb_thread_concurrency) ) * 0.9
Since the per-thread headroom (48 + 4 * innodb_thread_concurrency pages) is typically very small (unless you have lots of OS threads configured), max checkpoint age usually ends up being roughly 81% of the total log group capacity.