Brian Aker recently published good write up about using MySQL replication. The piece I find missing however is good description of warning about limits of this approach as well as things you need to watch out for. You can call me person with negative mind but I tend to think about limits as well. So what would be my list of limits and watch outs in this case ?
– Limited write scaling. As Brian writes this works best for read intensive envinronments. Write scaling however can be real problem. Not moderate increase of write ratio can dramatically reduce performance but you also need to watch out for slave being able to keep up at all. Remember writes to slave are serialized so if you master happens to be 8 core system with 20 hard drives you may be in trouble even if writes only consume 10% of resources as almost only one core and only one disk will be working on slave side.
Now let me clarify what I mean by moderate increasing of write load can cause the trouble. Assume you have same master and your write load loading master 60%, almost same 60% of resources used by the slaves to handle writes. You might have 10 slaves so read load might not be the issue. Now if your write load growths 50% – Master and slaves will start use 90% of their resources for replication (forget for the second replication being single thread, which just makes things worse), which means only 10% will remain for anything else. So your “read” performance will drop 4 times if write load growth by only 50%.
– Delays in replication MySQL replication is fast which means in average delay is small. Typically microseconds. However if there happens to be some bulk query on the master, even on completely unrelated table, replication lag may increase dramatically. Besides monitoring which Brian mentioned three other approaches could be used. First is to use master for real time reads when you want to have 100% guaranty you’re reading latest data. Second to use MASTER_POS_WAIT on the slave(s) to make sure update made it to them – this one has large overhead so should be used only for absolutely critical updates. Third one is to use cache – something like memcached, besides pure caching it can be used to store object version information so you can quickly check if version you read from the slave is current one or if you need to do master read.
– Waste of space Disk is cheap… if it is not a lot of high performance storage. If you would be replicating 1TB of data in 20 copies I guess you would understand me 🙂 For sake of high availability you probably need 2-3 copies. If you need significantly more slaves and they need a lot of storage you might need other solution.
– Waste of cache Memory is much faster than disk and you generally want to have data you’re accessing to be cached in memory by MySQL cache or OS cache. As slaves replicate the same data and often queried in random or round robin fashion their cache would have about same content, so if you would have 10 16GB boxes your “effective cache” would be still about 14GB not even close 140GB as total amount of cache memory you have. Using memcached or similar solutions on the top of things for caching as well as mapping of queries to different slaves helps but only up to certain point. Different solutions are needed if that becomes the problem.
So am I against simple replication ? Not really this is great solution which offers great MySQL Performance in many cases. It is good however to understand if this solution is right for you so you do not have to jump to other schemes without planning it. The simplicity which this solution delivers is more important than its problems for many small-medium sized application.
So in which case I would go using simple master slave replication ?
– Small database size. Small meaning working set fits in memory. This reduces waste of storage, removes problem with cache duplication (as each box can cache everything) as well as normally makes write load to consume small portion of resources.
– Light write load This is needed to avoid scalability problems of single thread and resource waste. I would call load light if it takes no more than 30% of resources if you only leave single disk and single core on the master. So you it can grow well enough before gibing you trouble.
– Few replicas – If you have 3-5 replicas it is fine. Up to 10 probably OK but more than that may mean other solutions would work better. Sometimes even 100 replicas can be OK if database is small and load is virtually read only. you may however want to structure replication with median servers to avoid master overload in such case.
– No long queries – As I already wrote these cause delay in replication even if they work on completely different (but replicated) data sets. Sometimes you do not care about stale data so you’re fine, somethimes you can work it around by skipping replication on of tables affected by long running queries.
That is all for today but far from being all what can be said about MySQL Replication.
Shouldnt large transactions (that is those that make a lot of changes) have increasing delay as well?
An alternative to mysql replication might be to use a filesystem replication instead such as http://www.drbd.org
“As slaves replicate the same data and often queried in random or round robin fashion their cache would have about same content, so if you would have 10 16GB boxes your “effective cacheâ€? would be still about 14GB not even close 140GB as total amount of cache memory you have.”
YES. Thank you! I was waiting form SOMEONE to bring up the problem with MySQL not using a larger effective cache.
If MySQL implemented a MyISAM row cache and implemented direct support for Memcached you could literally use all this memory and you’d see a HUGE boost in performance.
Were’nt you JUST saying the other day that the Linux page cache wasn’t very high performance?
I know of one major site which uses DRDB (with Postgres) and they seem to like it. It seems that it can only be used for failover though. Note that the IO for DRDB happens BEFORE the transaction completes (within the kernel) so if the master crashes you can promote your backup master. (or at least should be able to do it).
Another point…. yes large transactions (on MyISAM) will use locks longer but this can be faster than doing smaller transactions. Bulk inserting records with mulitiple values is about 11x faster in my app…
Kevin
Another thing I just realized… Rumor has it that MySQL 5.1 will support NDB backed by disk. This is essentially what I’m waiting for because not everyone can afford to throw EVERYTHING in memory. In this situation there would be no cache sharing between MySQL instances so maybe my memcached row cache idea is a good one.
Lucas,
Yes transactions with many changes also can cause spike in delay of MySQL replication as all statements from transaction will have to be accomulated and copied to binary log at once and than executed on the slave without being interleaved with other statements, so case of complex statement is just simple case for complex transaction.
In fact in both cases similar workaround of “chopping” can be applied – for long statements you can chop them to multiple statements. same applies to transaction. One very commonly used case is hanling bulk DELETES by using DELETE … LIMIT 1000 and executing it in the loop until no more loops can be deleted. If delete is using index to find rows it is very efficient.
Apachez,
DRBD is great solution but for a bit different problem. It also does replication f course but unlike with MySQL replication only one of the nodes can be used at the same time. So it gives you high availability but not much of extra performance.
DRBD however could be used together with MySQL Replication to increase High Availability – MySQL Replicaion is asynchronous so if master is lost, some transactions could be lost. However if master is replicated by DRBD you can avoid loosing any tranactions. It also greatly simplifies slave fallback which can be as simple as moving master IP to the new box.
The other way you can apply DRBD is to create network backup of your MySQL server completely live.
Kevin,
Yes this problem is often overlooked for some reason. MyISAM row cache in memcached could be interesting and actually not that hard to implement. We’ve recently implemented IO via mmap instead of read/write so writethrough cache for memcached should not be that hard. If would not do it writeback at least initially as memcached is not designed to be overly highly available. It is possible to do some clustering with updating multiple nodes but it would reduce performance.
Other thing – technically distributed file systems should do well for this purpose, of course if they optimize cache to avoid caching same data by same nodes. I have not benchmarked it however – should do once have a chance.
Speaking about NDB cluster – yes it has data on the disk in MySQL 5.1 – I however have not benchmarked it that seriously. I also guess it would take at least a year before it will be ready for serious production use.
Speaking about Linux page cache efficiency – yes, thanks for catching me. However it is all relative – for single disk you can do about 100-150 random IOs per second. For Page cache you can do some 1000 times more which is gread, however when data is in process memory you can do some 10 times more operations still.
So if your load is IO bound it is most important to get it cached ether in OS memory or in MySQL cached. If you’re getting CPU bound you should look for caching data in process memory.
Kevin,
Oh. I still forgot to comment on your bulk inserts thing. Yes that is kind of typical challange of throughtput and latency. There are very many cases when improving one hurts another and you need to find proper ballance.
Especially with MyISAM and good sized bulk_insert_buffer_size performance improvement can be dramatic. Sometimes I’ve got 100 times increase (using 100.000 values per insert statement or so) – The tricks are however to increse max_allowed_packet as well and keep key_buffer_size larger than bulk_insert_buffer_size otherwise performance could go down.
Kevin, LiveJournal uses DRBD with MySQL, for failover master server pairs. InnoDB and if one dies the other is started up, does file system recovery, then does InnoDB recovery.
Sometimes the cache efficiency can be addressed as Wikipedia does it. Splits groups of languages into sets of servers. Within each set, all servers get all writes via replication. Each language in the group is assigned to one or more slaves for the read part of the load. Because each slave is serving and caching only a portion of the work the effective cache size ends up closer to the sum of the cache sizes of all of the slaves. It made a major performance difference.
James Day, Support Engineer, MySQL AB and first DBA, Wikipedia.
James,
You’re right. DRBD makes great addition to MySQL replication, however it is not alternative if you’re also looking for increased performance, not only high availability.
Partitioning on read workload is good way to increase cache efficiency. In Wikipedia case it is not hard as languages are pretty much independent. As you however already mention you assign group of slaves for single language… this is where you’ve got to have this problem again. The reason it worked well for Wikipedia is likely – single language is small enough so it fits well in memory in single server, in this case it does not matter much. If your working set is 10G it does not matter if your effective cache size is 12G or 120G.
I’m quite curious how further growth would be archived. I guess you will still end up with partitioning and having different languages in different replication groups otherwise at some point in time replication will not be able to keep up. This would of course require very large amount of contributors to provide such write traffic.
Same about language – if working set of single language would grow to 100G or more before it would become typical size of memory on the server there may be troubles.
In general even though Wikipedia is very popular web site it is conceptually very good for scaling. The traffic must be reads in astinishing proportion, with much of it comming from anonymous users which can be served by caching proxies not even hitting web servers. Language provide very good partitioning as there is very limited dependence between them. Working set for each language is small.
I’m trying to sort out the different options, but I’m no DBA, and I am not sure I got them all on my list and where to start comparing…
There’s regular replication, built-into the system, which can be aided by Maatkit or Google MMM. Is that what you call “simple replication”? what’s the “non-simple” way?
There’s the problematic DRBD way of doing it but it seems cumbersome and limited, but still chosen by some, I can’t find any docs about the pros of this approach.
Maybe other systems I’m not aware of?
Are there any others to consider? is there a definite-best howto for building a master-slave duo? I sort of understand how the failover works, just not sure about how the fallen master becomes slave and resyncs and all… Any chance you would be posting a lower level introduction to this? I can’t find one site that lists them all, with either a list of the differences or pros and cons one needs to know about.
Thanks!
Ira
Because replication is a serial playback of SQL statements, one query can delay replication until it completes. The main issue with this is that the master allows updating unique databases simultaneously (or as concurrent as the operating system will allow). Because replication is serial playback only and runs only one query at a time, this can lead to delays and backlogs and it can also prevent a slave from catching up in a timely fashion.
The MySQL team needs to redesign replication to scan ahead from the binlog and groups the queries into queues of like databases. Then, allow up to X threads all pulling from the queues simultaneously to update each unique database (only one thread pulling from each specific database queue). This technique effectively replicates the simultaneous query nature of the master’s original queries and allows for much more timely replication on the slave.
This approach prevents a single query from holding back the entire replica. The issues would be how to resolve if a query fails (crashed table, aborted or other issue). My suggestion would be to write the query to a log file (or a failed queue) and state why the query failed. Should an error stop replication? Perhaps that’s a user setting and also depends on the failure type (i.e., crashed table). I’ve never liked that replication stops with aborted queries on the master. I understand the reasoning behind it, but an aborted query should warn the DBA and allow replication to continue… or least have a user configuration setting to let the DBA choose which to fail and which to skip.
The other issue is how to resolve dependencies between tables and data (table A requires data from table B, but only after table B has been updated in the proper order). On the master, though, there’s really no guarantee you’re going to get the most up-to-date data from a SELECT as an UPDATE query could follow right behind your SELECT. So, this may not be a major issue. As I said, I’d prefer to have multiple replication queries running on the slave simultaneous rather than as today (running them serially). Multithreaded replication ensures the slave is always up-to-date in a timely fashion.
Brian,
I guess a lot of talks are done about this multi thread replication but it is far from trivial when you want it to work for general case.
First failed queries – you can now set replication to skip errors but this is very dangerous thing to do, unless you know what exactly you’re speaking as it is very easy to get the copy which is completely out of sync with no error messages.
Now grouping queries by tables requires dependency tracking (though it is much easier with row level replication) and it also changes how data on the slave can look like.
Right now there is a consistence guaranty – the data on the slave always looks like data on the master sometime in the past. If you have multiple (per table per database etc threads) it is possible to see the data state which never existed on the master (when one thread runs ahead of the other)
There is solution to this too – to have syncronized commits which make data visible in batches to have things consistent but this requires multi-thread transactions and will not work with all storage engines.
Peter,
I think we need to define what ‘guaranteed consistency’ really means in terms of a slave. Does it mean making sure everything comes into the slave in the same exact order it was placed into the master’s binlog? Or, does it mean that each query gets properly inserted into the database successfully and timely? Or does it mean both?
Clearly, a query ends up in the master’s binlog based on, hopefully, when the query succeeds. However, that said, there appears to be no dependency tracking on the order in which the query was placed into the binlog. The order they are placed into the binlog, then, appears to be based strictly on the order of arrival + success. Clearly, doing a serial playback on the slave recognizes some semblance of order, but this is not optimized for speed or performance. It only takes one very long slow query to throw replication behind. The question then, is it more important to have replication out of sync for the sake of alleged consistency, or is it more important to keep replication up-to-date in spite of slow queries?
Granted, there might be other data behind a slow query that may be dependent on completion of that slow query. Again, this can be solved by grouping these dependent queries together behind the slow query and executing them in order. It’s fairly easy to recognize a dependent query simplistically.. 1) the query arrived after another query and 2) it references the same table as that previous query. Unrelated and non-dependent queries shouldn’t have to wait behind an extremely slow, unrelated query. Note that there may be system processes outside of the database that are dependent on that unrelated data that has been delayed. Is it right then, for alleged consistency purposes, to delay those non-related queries?
In other words, is the slave’s database considered consistent by being outdated compared to the master because of a slow query (or a series of them)? For this reason, I believe a redefinition of consistent is also in order when discussing a slave’s database.
Thanks.