This is the second in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1). In the first part, I wrote about single-threaded replication. Upstream from the replicas is the primary, which enables replication by writing a so-called “binary log” of events that modify data in the server. The binary log is a real limitation in MySQL.

The binary log is necessary not only for replication, but for point-in-time recovery, too. Given a backup and the corresponding binary log position, you can replay the binary log and roll forward the state of your server to a desired point in time.

But enabling the binary log reduces MySQL’s performance dramatically. It is not the logging itself that’s the problem — writing the log is usually not much additional work. It’s ensuring consistency and durability that is expensive. Flushing it to disk adds an fsync call for every transaction. And the server performs an XA transaction between InnoDB and the binary log. This adds more fsync calls, and causes mutex contention, and prevents group commit, and probably other things that aren’t coming to mind now.

The performance reduction can be an order of magnitude or more.

What’s the solution? I’m not sure I can summarize it concisely. There is a lot of complexity, and honestly I don’t understand some of the server internals fully enough to have a 50-thousand-foot view of it all. The binary logging and replication code, and its interaction with InnoDB, is difficult to understand. Kristian Nielsen has an extensive series of posts on group commit alone.

I think that a full fix might require significant architectural changes to MySQL. This will be hard. Maybe Drizzle is going in a good direction — time will tell. All of the solutions that I can think of are too simplistic. For example, doing replication through the InnoDB transaction log would work fine if a) all the data were in InnoDB, and b) InnoDB’s data didn’t have to be synchronized with the .frm files (and Drizzle has gotten rid of the .frm files, hooray), and c) privileges and other changes to the non-InnoDB data in MySQL were handled manually.

It could work if you just made sure that you didn’t change privileges or schema, but that’s a description of a pretty limited, clunky replication system from the user’s point of view. Still, I have considered it. There would need to be a mechanism of transporting the log files, and InnoDB would have to be put into a state of constant “recovery,” and it would have to be modified to be available read-only in this state so that it could be used for read queries. This can be done, of course. It’s just a matter of how hard it is.

It’s worth noting that PBXT does replication through its transaction logs, so there’s even precedent for this among MySQL storage engines. And there is Galera’s multi-master synchronization technology to look at, too.

11 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Shlomi Noach

Can you share if you have encountered cases where you saw PBXT on production?

Kristian Nielsen

Agree with your points here.

My group commit work will provide a partial solution to this (it is currently
available on
https://code.launchpad.net/~maria-captains/maria/mariadb-5.1-mwl116 and is
planned for MariaDB 5.3). This allows parallel transactions to share fsync()
calls, so for highly parallel applications the fsync() cost is dramatically
reduced. The Facebook patch has a similar solution.

I have an idea for taking this further: drop durability inside InnoDB
(–innodb-flush-log-at-trx-commit=0), and instead get durability and
consistency by recovering any missing transactions from the binary log. Then
only a single fsync() per group commit is needed (on the binary log). This
will make the binlog only slightly more expensive than using InnoDB durably
without the binary log. I have no code for this, but I think it is realistic
to implement within a reasonable timeframe.

Even better would be to use the InnoDB transaction log instead of the binary
log. A different way from what you described could be to somehow augment the
data that InnoDB already stores so that the full binlog events can be
reconstructed solely from the InnoDB transaction log. This would be the best
solution in terms of efficiency, as it allows to drop durability (no fsync()
at all) and still preserve consistency. But I agree with you that this is
quite involved, and maybe not realistic in the short to medium term. However,
I think if we get to one fsync() per group commit, we are in pretty good shape
already, able to do 1000-10000 commits per second on typical applications and
hardware.

I was not aware that Drizzle had worked on getting consistency and durability
in their replication work, I will have to check it out, it sounds interesting.

Shlomi Noach

Can you share whether you’ve encountered PBXT on production yet?

Jobin Augustine

@Kristian
“I have an idea for taking this further: drop durability inside InnoDB”
What you pointed out is the secret behind Oracle’s scalability 🙂
In Oracle, writing to data files happens in Asynchronously.
So dbwr (database writer process) will be full time engaged in flushing out the things asynchronously.

Durablity is assured though binary logs / transaction logs (we oracle people calls it “redolog”)
so startup followed by a crash does a quick recovery automatically (we call it “instance recovery”)

Rob Wultsch

@Jobin:
MySQL running InnoDB and replication effectively has two sets of logs. InnoDB has it’s own logs that it uses for crash recovery and what not, and MySQL has a log (known as the binary log) of statements or changes in in state which are then used for replication. Turning on the MySQL binary log has a massive negative impact on performance.

and Kristian
Perhaps BDB should be resurrected? It has internal replication.

Morgan Tocker

@Jobin: As Rob mentioned:

Redo log == Transaction Log.

The major difference in MySQL is that this log is not used for replication. There are some other subtle differences, such as Oracle’s Redo storing complete pages, whereas InnoDB will store just enough info to recreate the page.

The transaction log could be used for replication, but then we’d not be able to use multiple storage engines.

Patrick Casey

Jobin,

I believe INNODB already does asynchronous writes. If my recollection is correct, dirty blocks are flushed out to the logs on transaction commit. Sometime later, the main db writer process will write the dirty blocks out to the main data files.

In the event of a crash recovery, what the system does is compare all those blocks in the log with the corresponding block in the main data file with the assumption that the log is correct and the data file may not be.

You run into bottlenecks if you produce more garbage than the system can flush out (too many dirty blocks), in which case the system starts panic flushing dirty blocks which can bog the whole system down pretty aggressively.

Robert Hodges

@Kristian
Deriving the binlog from the InnoDB transaction log seems attractive if only because it makes that log the single version of truth and eliminates two-phase commit issues. As far as I can tell two phase commit is present is present to avoid inconsistency due to failures, so if you instead recover from the underlying log(s) that problem is solved, and you can now write binlog data asynchronously.

It would also be nice if the new “meta-binlog” would match journal semantics more accurately and allow interleaving. Otherwise it seems you still end up creating a major choke point when dumping large transactions into the binlog. You would also potentially be discarding information that allows you to parallelize later one.

@All
Finally, there’s a lot of desire to make all of this work cross-engine. What are the real use cases for cross-engine transactions? It seems to add a lot of complexity to implementation and allows people to mess up replication. What’s the benefit?

Patrick Casey

Having it work across engines fits with the mysql narrative e.g. pick the engine that’s appropriate for your workload, but be comfortable that all the engine share the same semantics and behaviors (with noteworthy, “this is the whole point of different engines” differences).

If you introduce replication that, for example, only worked for INNODB or PBXT for that matter, then you’d sort of be fighting the mysql marketing story.

That being said, I at least would be perfectly happy with bulletproof INNODB replication since that’s the only engine I use.

Of course what I *really* want is a clustered INNODB (mysql cluster is a non starter for my workload), but I’m aware that’s not in the cards.