Xaprb (Baron) recently blogged about how InnoDB performs a checkpoint , I thought it might be useful to explain another important mechanism that affects both response time and throughput – The transaction log.

The InnoDB transaction log handles REDO logging, this is the mechanism that provides the A (Atomic) C (Consistent)  and D (Durability) in ACID. The transaction log keeps a complete record of every change that occurs to the pages inside the database. Instead of logging whole pages (Value / Physical logging) where both the before and after image of a page is logged or by logging logical changes  (Logical logging) to the dataset, InnoDB uses  Physiological logging, this basically means that it combines the two techniques to provide a logging mechanism that is both optimal in terms of the amount of data logged whilst still ensuring the log is action consistent and each operation is idempotent. It is important to note that physiological logging needs to operate on consistent pages, Innodb achieves this by using the double write buffer to ensure consistent page writes.

Why log?

Since InnoDB tries to keep the working set in memory (InnoDB Buffer Pool), therefore the changes made by transactions will occur in volatile memory and later be flushed to disk. So in the event of volatile memory failure or during a system restart InnoDB can guarantee to have a (C)onsistent record of the state of the data in the database in (D)urable memory and that each transaction is (A)tomic.

What is actually logged?

<space id, page no, operation code, data>

This translates into operations such as

<0, 2600, ‘insert’, ‘after record at offset 8192’, ‘(38, Percona, Fortune…)’>

The log header stores information about the last checkpoint. The log is stored as records rather than pages and has 512 byte alignment, matching disk sectors. Percona Server with XtraDB allows you to control the transaction log record size, as better performance may be gained with 4096 byte records with newer media such as SSD cards.

Why Idempotent?

Each record in the log needs to be able to be redone multiple times since at restart and media recovery we will not know if that action has previously been executed as we will start recovery from the last checkpoint.InnoDB achieves this by reading the LSN of the page and compares that to the one of log record, if the LSN is higher that means the page was flushed after this record was processed so the change does not need to be applied.

How does this affect response time?

In order to keep ACID compliance, the transaction log must guarantee the logging action happens before the transaction is committed, this is known as write-ahead-logging. This essentially means that before an update can return it must be logged. As the time to log is added to every update it can become an important overhead to your response time. Indeed if InnoDB cannot log at all, your transaction will never complete.

Both the MySQL configuration and hardware will affect the speed at which InnoDB can log. You can request that InnoDB force its REDO records to the transaction log but not actually flush the log to disk at commit. By this means you are ensuring that in the event of a MySQL crash the data will still be consistent, however in the event of a system crash there may be some records that never make it to durable memory.

The three options available are:

innodb_flush_log_at_trx_commit = 0 /* Write log buffer to log about once a second and flush */

innodb_flush_log_at_trx_commit = 1 /* Write log buffer to log and flush to disk */

innodb_flush_log_at_trx_commit = 2 /* Write log buffer to log and flush about once per second */

(About once a second is related to the fact the flush and sync is controlled by a background thread that gives no guarantee it will be per second)

NB: Percona  Server with XtraDB actually allow you to control this at session level.

This allows you to risk durability by having the logging return before the record is actually on disk. A similar and complementary mechanism is to use a battery backed raid card. This allows updates to be written to the raid cards battery protected memory which has much faster access than the underlying disk(s).

How are the REDO records applied?

InnoDB processes the REDO records from the last checkpoint, each record’s LSN is compared against the LSN of the page and if the on disk LSN is equal or higher the record is skipped. This behavior can be seen by enabling the innodb_recovery_stats option in Percona Server, which allows you to see the progress made during recovery.

How does this affect throughput?

As the REDO log in InnoDB uses a fixed length circular transaction log, the speed at which you can process updates is tightly linked to the speed at which check-pointing can occur. In order to insert a new record in the transaction log, InnoDB must ensure that the previously written record has been flushed to disk . This means that it can be beneficial to have very large transaction logs which allow a larger window between REDO logging and the checkpoint on disk.

Vadim touched upon this subject recently in his post MySQL 5.5.8 – in search of stability, as seen in his post, Percona Server with XtraDB allows much larger logs beyond the InnoDB limit of 4G.

As a side effect of logging there is no need to apply page changes at commit, therefore what would be a random write is translated into a much faster sequential write. The background processing of dirty pages can then apply algorithms to optimize those writes during checkpoint.

What else?

There is of course one component missing to our picture, just because each change is logged it does not mean we can reconstruct a consistent set of data. A change may be logged and then later rolled back do to an error or a specific request, as part of the DO-UNDO-REDO protocol InnoDB also records UNDO information, however that is stored outside of the transaction log and as such is not covered here.

19 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Justin Swanhart

Very nice post Ewen!

Pavel

I didn’t know about the innodb_recovery_stats … thanks for the nice writeup!

Paul Keenan

Good article. Would be more helpful if you could define LSN or provide a link.

Paul Keenan

Thanks Ewen, makes sense.

Swarn Singh

Really helpful and comprehensive article…Ewen

viv

Hi Ewen

I have a question about the innodb flush the data into the redo log. When there is one value updated , the BP would mark this page as dirty, and the changes would be inserted into redo log. When it is time to flush the changed data into data file, it should be flushed from BP or from redo log? That is the part I dont understand, and I can not find any answers on the internet. As if it should flush from redo log, then the page in BP should not be dirty anymore, cause it contains the newest value, which means how can master thread flush the redo log when it reaches max. percentage of dirty page? as it becames clean as soon as it flush the changes into redo log. If the page is still dirty in BP, and the changes still in the redo log, but not data file, if there is a query need to read this value, where should it fetch from?

Thanks if you could explain a bit to me 🙂

Baron Schwartz

Thank you for posting this on our forum at http://forum.percona.com/index.php/mv/msg/1801/ where we will answer instead of in the comments.

Viv

Ewen, thanks a lot for the explanation, I finally understand the relationship now. Thanks a lot!:)

chang

Hi Ewen

what is the relationship between undo and redo log in innodb? How does innodb keep the consistent in the case of innodb_flush_log_at_trx_commit != 1? In other words, some pages are flushed and others aren’t, and while redo log is missed.

one possibility is don’t flush dirty pages until related redo-logs are flushed, but it looks only allowing one transaction per page.

what is innodb’s way?

Viv

Ewen, thanks a lot for the explanation, I finally understand the relationship now. Thanks a lot!:)

Baron Schwartz

Thank you for posting this on our forum at http://forum.percona.com/index.php/mv/msg/1801/ where we will answer instead of in the comments.

viv

Hi Ewen

I have a question about the innodb flush the data into the redo log. When there is one value updated , the BP would mark this page as dirty, and the changes would be inserted into redo log. When it is time to flush the changed data into data file, it should be flushed from BP or from redo log? That is the part I dont understand, and I can not find any answers on the internet. As if it should flush from redo log, then the page in BP should not be dirty anymore, cause it contains the newest value, which means how can master thread flush the redo log when it reaches max. percentage of dirty page? as it becames clean as soon as it flush the changes into redo log. If the page is still dirty in BP, and the changes still in the redo log, but not data file, if there is a query need to read this value, where should it fetch from?

Thanks if you could explain a bit to me :)

feedhoo

undo in redo logfile ?

Karthik.P.R

Nice Post Ewen