I recently mentioned a few of the big “non-starter” limitations Postgres has overcome for specific use cases. I decided to write a series of blog posts on MySQL’s unsolved severe limitations. I mean limitations that really hobble it for major, important needs — not in areas where it isn’t used, but in areas where it is used and thus is used very wastefully compared to its potential.

The first glaring problem is single-threaded replication. It is severe and getting much worse as servers get more and more CPUs and CPU cores. The replication replay process executes in a single thread on the replicas, and thus has no hope of keeping up with even a moderately busy write load on the primary, where many updates are occurring concurrently.

In a lot of Web applications, this isn’t really seen as a huge limitation. That’s because these apps are mostly read traffic, so a single primary can delegate the read workload across several replicas, and the write workload is still a relatively small fraction of the server’s total capacity. But eventually, it does become a problem if the app gets large enough, no matter how large the read-to-write ratio is.

What are some workarounds? Here are a few I’ve seen:

  • Use DRBD replication instead of MySQL replication. Problem: you end up with an idle standby server, which can’t be used for serving reads, only for disaster recovery. That’s costly.
  • Shard. Write workload is the single most legitimate reason for sharding. It’s too bad that “the replica can’t keep up with the write workload” becomes the reason to shard. I wish we could avoid that.
  • Do the replication in the application. I know of applications where they just don’t use built-in replication. When they modify some data, they do it in both places. That’s a headache.
  • Try obscure techniques such as external processes to prefetch the data the replica is trying to modify, so it can do it faster. This rarely works.

I’m not criticizing anyone who does these things — there really isn’t much of a good choice. It’s all a matter of picking the solution that’s least evil.

Why isn’t there multi-threaded replication? I think that a) it’s not as easy as it seems, and b) there are a hundred edge cases that make it nearly impossible to choose a solution that works for all situations. For example, mixtures of transactional and non-transactional tables are a nightmare.

Here are a few ideas I’ve either heard, or discussed, or thought of myself:

  1. One thread per database. If the application is built such that each database is fully independent, then on the replica, we could start up a thread for each database we see in the binary log, and simply pass the replication events to the appropriate thread.
  2. A pool of threads on the replica, and a coordinating thread that hands work to each of them. The coordinator would read from the relay log until it has a complete transaction’s worth of events (not a single statement), and hand the whole transaction to the worker thread. The worker thread would run up until the COMMIT, but not actually commit, and then report back to the coordinator thread. The coordinator thread would ensure that all of the transactions begin in the same order as defined in the relay log, and commit in the same order. If any error occurred, such as a deadlock or lock wait timeout, then the coordinator would instruct the workers to roll back, and either retry or make them execute the problematic transactions in serial instead of concurrently.
  3. Multiple binary logs on the primary, one per database; one replication process per binary log on the replica. This would have the advantage that it would allow a replica to subscribe to multiple masters, which is currently impossible.

These solutions represent different types of trade-offs. For example, solution 1) only works for specific uses, and I don’t think it’s general-purpose enough. Solution 2) has potentially complex behavior that might not work well in some cases, such as when deadlocks are common; but it is overall the least disruptive or different from the user’s point of view. Solution 3) requires modifying the binary logging code, which is risky. It also requires maintaining many master.info files on the replica, and new SQL syntax for administering replicas, and is generally not something I personally want to administer (replication is fragile enough… imagine recovering after a crash when you have to fix multiple threads that have forgotten where they should be reading in multiple binlogs?).

Regardless of the solution, it is certain that nothing is going to work in all cases; the most common cases will require use of InnoDB with the proper transaction isolation level, at a minimum. This behavior is going to have to default to single-threaded as replication currently does, and only enable the multi-threaded behavior if the user configures it to do so.

I would be in favor of solution 2) with an option to configure it to behave as solution 1).

The other big question is who wants to sponsor its development? You know who to contact if you do!

17 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Morgan Tocker

I think it’s important to point out why Statement-Based replication has to be single threaded: you essentially need to have serializable isolation to ensure that the slaves produce the same results when just replaying statements. It’s very difficult to ensure consistency otherwise.

I like your suggestion (1), but you’re right – it’s not generic enough. If users modify data across databases, they will eventually have inconsistency.

Moving to Row-based replication (how most other databases do it) is the first step of safely replaying in parallel.

LenZ

Hi Baron! Have you seen this ongoing effort yet? http://forge.mysql.com/wiki/ReplicationFeatures/ParallelSlave

Mark Callaghan

LenZ – one of the possibly permanent limitations for that feature is “All transactions on slave should use SERIALIZABLE isolation level”. Who uses SERIALIZABLE with InnoDB?

Brian

I think I could jump on to support multiple database replication, when Mysql Actually implements multiple databases in their design. Right now what MySQL calls databases are only in really Schemas. Granted for a lot of people it would mean just shoveling all their data into the default schema or db but it would make things more consistent with the other major vendors.

Tom Hanlon

Baron,

I am just thinking out loud, so there may be holes in my logic but here is a thought anyhow.

Would replication be more scalable and easier to manage if the replication happened *before* the updates.

Meaning… instead of
1. SQL event is receieved
2. Changes are made to the tables
3. Event is written to the binlog
4. Replicant reads binlog

What if at the Parse level of the query we also sent the query off to the replicant. Something like.

1. SQL event is recieved
2. Statement is parsed and sent to slave as well as executed locally.

This provides a solution similar to the build replication it into your application solution.

Now that I think more about it.. I see some holes in my thinking involving race conditions and consistency.

Oh well, I leave my comment as a suggestion to perhaps look at the problem in another way, a sort of pre-action-replication rather than a post action replication.

Another suggestion would be to solve single threaded replication for innodb only. Dealing with the non transactional tables leads to a “race to the bottom” in terms of concurrency.


Tom Hanlon

Tyrael

http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html

“The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized.”

“This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas’ states to diverge.”

“If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated.”

“The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)
A deterministic DBMS prototype.”

Patrick Casey

Option #4: Punt and do what Oracle does … move redo log information across the wire. Should work like ganbusters for innodb although its a non starter for myisam.
Fundamentally, if innodb can restore a database from the .idb file + the logs, and I start master and slave with same .idb file, then all I need to ship over the wire is the log blocks. That’ll take care of all synchronization processes and the slave can treat the process similarly to a restore.

Fundamentally, I think to solve this problem its going to have to happen at a lower level that statement or row based operations.

Patrick Casey

Great minds think alike I suppose :).

Shlomi Noach

Baron,

I was having similar thoughts with regard to solution #2. Moreover, I suggest it can actually be solved outside the server, using a script, given enough privileges and access to mysqlbinlog.

Dathan Pattishall

As part of your serious limitation series I would mention the SQL parser. Additional to this I would mention the benefits of using epoll over the current implementation.

Nickolay Ihalainen

Baron,
You mean scenario like this?

1) on master classify statement
if table.a is modified then write binlog to file 1 (main thread)
if table.b is modified then write binlog to file 2
if table.b modified with table a data then write to file 1 (main thread) and write to all other threads statement WAIT MASTER POS=XXXXX
if table.n modified write to table n % max_slave_threads
if table x engine like myisam then write to file 1 (main thread) and write statement WAIT MASTER POS=YYYYY

2) on slave
connect to main binlog
check if server supports additional binlogs, connect to additional binlogs.
if additional binlog finds WAIT statement it sleeps for master position.

Also a dedicated piece of software can:
1) connect to mysql master
2) read and classify binlog
3) execute in one or n connections statements on slave
4) if slave crashed/disconnected there is no way to detect slave exec position, but we can test/debug all logic and replace the default slave algorithm (optionaly).

Harold A. Scott

Thanks for this article, I understood some parts but not everything, does anybody know of a good executive overview? I am just starting to get into programming so the code parts confused me a bit 🙂