Unfortunately it is not possible to point to a specific category of applications and say, “PBXT will be better here, so try it”. PBXT is a general purpose transactional storage engine, designed to perform well on a broad range of tasks, much like InnoDB. However, PBXT’s log-based architecture makes performance characteristics different to both MyISAM and InnoDB/XtraDB. Tests show that PBXT’s performance is similar to InnoDB but, depending on your database designed and the application, it can be faster.
PBXT is a community project and, of course, we depend on users trying it out. In the long run, this will determine to what extent we are able to continue to develop and improve the engine. So, despite this rather vague answer, we are hoping that more people try it out, and work with us to improve the engine as necessary. My thanks to all who are already doing this!
This is no longer necessarily the case. For example a test (http://mysqlha.blogspot.com/2009/03/pbxt-is-fast-no-kidding.html) by Mark Callaghan shows that PBXT can actually out perform InnoDB with SELECTs under circumstances.
The implementation of full-durability has changed the performance characteristics of PBXT from “MyISAM-like” to more InnoDB-like. Originally PBXT was conceived as an engine that would be somewhere between MyISAM and InnoDB in both performance and features. The early version of PBXT was not fully durable (equivalent to innodb_flush_log_at_trx_commit=2).
A major change was completed at the beginning of last year with the implementation of full-durability. In doing this it was important to keep the log-based architecture which was the reason for the high write performance of earlier versions.
Traditional transactional implementations suffer from the problem that a backlog of asynchronous writes accumulate until it swamps the engine. There has been a lot of work on both InnoDB and XtraDB to solve this problem. The key words here are fuzzy and adaptive checkpointing (the former, originally implementation by Heiki for InnoDB, and the latter, an excellent addition to XtraDB).
Both methods improve the management of the asynchronous writes. The idea behind the log-based solution, on the other hand, is to avoid the accumulating a backlog of asynchronous writes, but writing synchronously.
Although write performance is comparable with InnoDB, I am not entirely convinced that PBXT’s implementation of the log-based I/O is optimal at this stage. This is ongoing work for PBXT 1.5.
Morgan notes: As well as Adaptive Checkpointing, Oracle has also been working on Adaptive Flushing for the InnoDB Plugin. The engine being ’swamped’ problem that Paul is referring to is best described visually – see this post for more info.
If you read the white paper from 2006 (http://primebase.org/download/pbxt_white_paper.pdf) you will notice that the original design was uncompromisingly MVCC-based. Some of this has been changed to make PBXT more InnoDB-like, but other principles have remained.
Pure-MVCC does not do any locking. Read locks are not required because each transaction effectively gets its own snapshot of the database. And write locks are not acquired when updating. Instead, the application can hit a “optimistic lock error” if a record is updated by another user.
Now PBXT does acquire locks for 2 reasons: to support SELECT FOR UPDATE, and to avoid optimistic locking errors. This makes PBXT’s behavior identical to InnoDB in REPEATABLE READ mode.
On the other hand, there are currently no plans to implement InnoDB style “gap locking”. Gap locking effectively involves locking rows that do not exist. This, in turn, means that PBXT transactions are not SERIALIZABLE. A result of this is that statement-based replication is not supported by the engine.
Another hard decision was not to implement clustered indexes which I mentioned in more details later.
A recent version of PBXT (1.0.09) supports the MySQL Backup API which was originally implemented in MySQL 6.0. This feature is now scheduled for an upcoming version of MySQL 5.4.
The Backup API makes it possible to pull a consistent snapshot of an entire database even when tables use different engine types. The API does not yet support incremental backup, but this is planned.
Internally this feature is implemented by PBXT using an MVCC-based consistent snapshot.
PBXT has several system threads, that are responsible for various maintenance tasks. The most important of these are the “Writer”, the “Sweeper” and the “Checkpointer”;
No, currently it does not. This is one of the original design decisions (as raised by a previous question). Two things contributed to this decision:
PBXT uses 16K pages for the index data and (approximately) 32K pages for the table data. Both sizes can be set using compile time switches. However, if the index page size is changed, then the indices need to be rebuilt, which can be done by REPAIR TABLE. The table data page size does not require a rebuild because a page of records in the table is just a group of records (not an actual fixed length page).
If you are using InnoDB in REPEATABLE READ mode, then there is essentially no difference in the isolation paradigm between the two engines.
REPEATABLE READ is often preferred over SERIALIZABLE mode because it allows a greater degree of concurrency while still providing the necessary transaction isolation. So I do not consider the lack of serializability as a serious deficit in the engine. And, fortunately MySQL 5.1. supports row-based replication which makes it possible to do replication while using REPEATABLE READ.
PBXT does use MVCC to do index scans. Basically this means that all types of SELECTs can be done without locking.
Morgan notes: Indexes not using MVCC is one of the main differences in the Falcon storage engine.
Firstly, PBXT does not acquire read locks. A normal SELECT does not lock at all. In addition, an UPDATE or DELETE only acquires a temporary row-lock. This lock is released when the row is updated or deleted because the MVCC system can detect that a row has been changed (and is therefore write locked).
This means that PBXT does not normally need to maintain long lists of row-level locks. This is also the case when a foreign key leads to cascading operations which can affect thousands of rows.
The only case you need to be aware of is SELECT FOR UPDATE. This operation acquires and holds a row-level lock for each row returned by the SELECT. These locks are all stored in RAM. The format is quite compact (especially when row IDs are consecutive) but this can become an issue if millions of rows are selected in this manner.
I should also mention that the consequence of this is that SELECT … LOCK IN SHARE MODE is currently not supported.
Yes, I think you have mentioned the most important criteria. What I can add to this list are 3 things that make developing a storage engine extremely demanding: performance, stability and data integrity.
Of course, as a DBA or database user these aspects are so basic that they are taken for granted.
But engine developers need to keep performance, stability and data integrity in mind constantly. The problem is, they compete with each other: increasing performance often causes instabilities that then have to be fixed. How to optimize the program without compromising data integrity is a constant question.
Relative to maintaining performance, stability and data integrity, adding features to an engine is easy. So I would say that these are the criteria that concern a developer the most.
Unfortunately the “war” continues. I have already received several e-mails that PBXT does not compile with the recently released MySQL 5.1.41!
Any dot release can lead to this problem, and I think PBXT is fairly moderate with its integration into MySQL.
My main advantage: I have been able to avoid modifying any part of MySQL to make the engine work. This means that PBXT runs with the standard MySQL/MariaDB distribution.
But this has required quite a bit of creative work, in other words, hacks.
One of the main problems has been running into global locks when calling back into MySQL to do things like open a table, create a session structure (THD) or create a .frm file.
One extreme example of this is PBXT recovery. When MySQL calls the engine “init” method on startup it is holding the global LOCK_plugin lock. In init, the engine needs to do recovery. In PBXT’s case this means opening tables (reading a .frm file), which requires creating a THD. The code to create a THD in turn tries to acquire LOCK_plugin!
Unfortunately a thread hangs if it tries to acquire the same mutex twice, so this just does not work!
We went through quite a few iterations (MySQL code was also changing during the development of 5.1) before we came up with the current solution: create a background thread to do recovery asynchronously. So the thread can wait for the LOCK_plugin to be unlocked before it continues.
The affect is that the init function returns quickly, but the first queries that access PBXT tables may hang waiting for recovery to complete.
No, this is not by design.
While I try to only add tuning parameters that are absolutely necessary, PBXT is not specifically designed to be self-tuning, because I believe that is a very hard problem to solve in general.
Tuning parameters are often added to an engine in response to performance problems in particular configurations. This is not necessarily a bad thing because it provides DBA’s with the tools they need.
My goal for PBXT in this regard is twofold:
Morgan notes: There are more in InnoDB/XtraDB now than there were three years ago. This is probably something that emerges over time as we get to understand more about an engine.
Entry posted by Morgan Tocker | No comment
]]>When you see bug happening you would see MySQL log flooded with error messages like this:
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
091119 23:03:34 [ERROR] Error in accept: Resource temporarily unavailable
filling out disk space
Depending on the case you may be able to connect to MySQL through Unix Socket or TCP/IP or neither.
It also looks like there is a correlation between having a lot of tables and such condition.
Previously I was unlucky with seeing this issue in production so we had to restart MySQL quickly currently I have a test MySQL showing some behavior.
Here is what strace tells me:
[percona@test9 msb_5_4_2]$ strace -f -p 19229
Process 19229 attached with 23 threads – interrupt to quit
[pid 19286] rt_sigtimedwait([HUP QUIT ALRM TERM TSTP],
[pid 19285] futex(0×165962ec, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19284] select(0, NULL, NULL, NULL, {0, 765000}
[pid 19283] select(0, NULL, NULL, NULL, {0, 412000}
[pid 19248] futex(0×1781193c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19247] futex(0×178118bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19246] futex(0×1781183c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19245] futex(0×178117bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19244] futex(0×1781173c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19232] futex(0×1781113c, FUTEX_WAIT_PRIVATE, 165, NULL
[pid 19229] accept(16392,
[pid 19241] futex(0×178115bc, FUTEX_WAIT_PRIVATE, 319, NULL
[pid 19243] futex(0×178116bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19242] futex(0×1781163c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19240] futex(0×1781153c, FUTEX_WAIT_PRIVATE, 3, NULL
[pid 19239] futex(0×178114bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19238] futex(0×1781143c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19237] futex(0×178113bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19236] futex(0×1781133c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19235] futex(0×178112bc, FUTEX_WAIT_PRIVATE, 7, NULL
[pid 19234] futex(0×1781123c, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19233] futex(0×178111bc, FUTEX_WAIT_PRIVATE, 207, NULL
[pid 19231] futex(0×178110bc, FUTEX_WAIT_PRIVATE, 1, NULL
[pid 19229] <... accept resumed> 0×7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0×7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0×7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] accept(16392, 0×7fffffac70b0, [18423225245113516048]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] fcntl(16392, F_SETFL, O_RDWR) = 0
[pid 19229] select(16394, [1025 1040 1042 1044
....
8 9709 9714 9716 9717 15368 15369], NULL, NULL, NULL*** buffer overflow detected ***: strace terminated
======= Backtrace: =========
/lib64/libc.so.6(__chk_fail+0×2f)[0x35f36e6aff]
/lib64/libc.so.6[0x35f36e5ad3]
strace[0x408b60]
So as you can see accept gets pretty high socket number – probably because of large innodb_open_files I tested with in this case – which all can be used during recovery.
Note the process gets accept on the socket to fail with EAGAIN (which is not well described in the manual) and later getting call to select call with 16384 sockets. This does not seems to be the healthy number to work with SELECT call and It is quite possible something goes wrong because of it.
If you have any ideas why could be going wrong in this case.
Entry posted by peter | 2 comments
]]>Unlike MyISAM Innodb does not have to keep open file descriptor when table is open – open table is purely logical state and appropriate .ibd file may be open or closed. Furthermore besides MySQL table_cache Innodb maintains its own (called data dictionary) which keeps all tables ever accessed since table start – there is no variable to control its size and it can take significant amount of memory in some edge cases. Percona patches though provide innodb_dict_size_limit to restrict growth of data dictionary.
So I started with same series of test and creating 100.000 tables with single integer column. The process of creating tables took about 45 minutes which is a lot more than MyISAM and the total size on disk was 12GB in .ibd files plus some space allocated in system tablespace. So if you create Innodb tables you better store some data in them otherwise there will be a huge waste of space.
I used MySQL 5.4.2 for tests which should be as good as it gets in terms of optimizations in this space.
To keep test alligned to my previous experiments I was running with table_open_cache=64 and tried innodb_open_files=64 and 16384.
Reading 100.000 tables first time after MySQL time takes about 500 seconds (so it is some 200 tables/sec) – first time Innodb actually populates data dictionary. The second time we do same operation it takes about 25 seconds (4.000 tables/sec) which is quite a difference. As we can see even in case table it fully in Innodb data dictionary the operation is slower than MyISAM tables. Though the difference can be related to the size of set of empty tables which is about 10 times smaller for MyISAM.
I found no significant difference whatever limit of open files was, which is not surprising as logical operation of opening file is rather fast on local file system – one can open/close file hundreds of thousands times per second.
To verify this I tried doing “open table” test for only 10K out of 100K tables – the performance was about the same, taking 1sec (on the second time) . Whenever innodb_open_files_limit was 64 (virtually all misses ) or 16384 (all hits) performance was the same.
As I mentioned Data Dictionary can take considerable amount of memory – In my case after reading all tables I got “Dictionary memory allocated 392029720″ which means very simple single tables takes about 4KB of space in data dictionary. More complicated tables can take a bit more.
So innodb_open_files does not affect performance a lot on reads – what is about writes ? I tried again very simple test inserting the row in each of 100K tables. This test ran about 180sec first time and about 260sec second time (with innodb_flush_log_at_trx_commit=0) go giving 550 and 380 updates/sec appropriately. Why was second time slower and not faster ? Because on the second run there were a lot of dirty pages in innodb buffer pool which had to be flushed before recycling. First ran however was done with clean buffer pool (after reading all tables once)
Same as with select case I could not see any measurable difference between two tested innodb_open_files values.
Finally I decided to test the crash recovery – does it make any difference ?
The crash recovery in Innodb is nasty if you have a lot of tables:
091118 18:43:36 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files…
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer…
InnoDB: Doing recovery: scanned up to log sequence number 12682768136
091118 18:47:44 InnoDB: Starting an apply batch of log records to the database…
If Innodb detects it is not shut down properly it will scan all .ibd files which took a bit over 4 minutes for 100K tables but which obviously can take a lot more if there are more tables or they are less cached than in this case. This part of data recovery does not depends on amount of records which need to be applied just about number of tables.
With innodb_open_files=64 I got bunch of warning messages during recovery:
091118 18:47:44 InnoDB: Warning: too many (67) files stay open while the maximum
InnoDB: allowed value would be 64.
InnoDB: You may need to raise the value of innodb_max_files_open in
InnoDB: my.cnf.
InnoDB: fil_sys open file LRU len 0
091118 18:47:44 InnoDB: Warning: too many (68) files stay open while the maximum
InnoDB: allowed value would be 64.
InnoDB: You may need to raise the value of innodb_max_files_open in
InnoDB: my.cnf
So we can see Innodb may with to have so many open files during recovery stage and it will open more files than allowed if needed.
Both scanning open files and applying logs took about 9 minutes in this setup. This number of course can change a lot depending on hardware log file size workload and even when crash happen (how many unflushed changes we had)
Repeating test with innodb_open_files=16384 I got about same crash recovery speed though with no warnings.
So it looks like innodb_open_files_limit=300 is not being that large liability even with large number of tables and you can also safely increase this number if you like – there is no any surprises such as surprised slow downs for replacing open files in the list. I guess Heikki knows how to implement LRU in the end
Entry posted by peter | 2 comments
]]>We are pleased to present the 20th build of MySQL server with Percona patches.
Comparing to the previous release it has following new features:
<rpm name>-<mysql version>-<percona build version>.<buildnumber>.<redhat version>.<architecture>.rpm
Example:
MySQL-server-percona-5.0.87-b20.29.rhel5.x86_64.rpm
See release notes for earlier changes.
Since the build 20 MySQL server with Percona patches is available in Percona RPM repository via YUM. To make it working add a file Percona.repo in /etc/yum.repos.d with following content
As usual you can download binaries and sources with the patches here
http://www.percona.com/mysql/5.0.87-b20/
There is Debian packages repository is also available. See release page for configuration and usage guideance.
The Percona patches live on Launchpad : https://launchpad.net/percona-patches and you can report bug to Launchpad bug system:
https://launchpad.net/percona-patches/+filebug. The documentation is available on our Wiki
For general questions use our Pecona-discussions group, and for development question Percona-dev group.
For support, commercial and sponsorship inquiries contact Percona.
innodb_rw_lock.patch
Entry posted by Aleksandr Kuzminsky | 2 comments
]]>The "common sense" approach to tuning caches is to get them as large as you can if you have enough resources (such as memory). With MySQL common sense however does not always works - we've seen performance issues with large query_cache_size also sort_buffer_size and read_buffer_size may not give you better performance if you increase them. I found this also applies to some other buffers.
Even though having previous experience of surprised behavior I did not expect such a table_cache issue - the LRU for cache management is classics and there are scalable algorithms to deal with it. I would expect Monty to implement one of them.
To do the test I have created 100.000 empty tables containing single integer column and no indexes and when ran SELECT * FROM tableN in the loop. Each table in such case is accessed only once and on any but first run each access would require table replacement in table cache based on LRU logic.
MySQL Sandbox helped me to test this with different servers easily.
I did test on CentOS 5.3, Xeon E5405, 16GB RAM and EXT3 file system on the SATA hard drive.
MySQL 5.0.85 Created 100.000 tables in around 3min 40 sec which is about 450 tables/sec - This indicates the "fsync" is lying on this test system as default sync_frm option is used.
With default table_cache=64 accessing all tables take 12 sec which is almost 8500 tables/sec which is a great speed. We can note significant writes to the disk during this read-only benchmark. Why ? Because for MyISAM tables table header has to be modified each time the table is opened. In this case the performance was so great because all 100.000 tables data (first block of index) was placed close by on disk as well as fully cached which made updates to headers very slow. In the production systems with table headers not in OS cache you often will see significantly low numbers - 100 or less.
With significantly larger table_cache=16384 (and appropriately adjusted number of open files) the same operation takes 660 seconds which is 151 tables/sec which is around 50 times slower. Wow. This is the slow down. We can see the load becomes very CPU bound in this case and it looks like some of the table_cache algorithms do not scale well.
The absolute numbers are also very interesting - 151 tables/sec is not that bad if you look at it as an absolute number. So if you tune table cache is "normal" case and is able to bring down your miss rate (opened_tables) to 10/sec or less by using large table_cache you should do so. However if you have so many tables you still see 100+ misses/sec while your data (at least table headers) is well cached so the cost of table cache miss is not very high, you may be better of with significantly reduced table cache size.
The next step for me was to see if the problem was fixed in MySQL 5.1 - in this version table_cache was significantly redone and split in table_open_cache and table_definition_cache and I assumed the behavior may be different as well.
MySQL 5.1.40
I started testing with default table_open_cache=64 and table_definition_cache=256 - the read took about 12 seconds very close to MySQL 5.0.85.
As I increased table_definition_cache to 16384 result remained the same so this variable is not causing the bottleneck. However increasing table_open_cache to 16384 causes scan to take about 780 sec which is a bit worse than MySQL 5.0.85. So the problem is not fixed in MySQL 5.1, lets see how MySQL 5.4 behaves.
MySQL 5.4.2
MySQL 5.4.2 has higher default table_open_cache so I took it down to 64 so we can compare apples to apples. It performs same as MySQL 5.0 and MySQL 5.1 with small table cache.
With table_open_cache increased to 16384 the test took 750 seconds so the problem exists in MySQL 5.4 as well.
So the problem is real and it is not fixed even in Performance focused MySQL 5.4. As we can see large table_cache (or table_open_cache_ values indeed can cause significant performance problems. Interesting enough Innodb has a very similar task of managing its own cache of file descriptors (set by innodb_open_files) As the time allows I should test if Heikki knows how to implement LRU properly so it does not have problem with large number. We'll see.
Entry posted by peter | 13 comments
]]>While in Atlanta I'll be giving a talk at the Atlanta PHP User Group on Optimizing MySQL Performance (details to be posted to their website shortly). If you're in Chicago and would like me to speak at your group on 7th-8th December, let me know!
Entry posted by Morgan Tocker | No comment
]]>Entry posted by Morgan Tocker | No comment
]]>What I didn't mention was that if you've established that you will need to eventually shard, is it better to just get it out of the way early? My answer is almost always no. That is to say I disagree with a statement I've been hearing recently; "shard early, shard often". Here's why:
Or to phrase that another way:
I would never recommend sharding to a customer until I had at least reviewed their slow query log with mk-query-digest and understood exactly why each of the queries in that report were slow. While we have some customers who have managed to create their own tools for shard automation, it's always easier to propose major changes to how data is stored before you have a cluster of 50+ servers.
Entry posted by Morgan Tocker | 5 comments
]]>
$ make
g++ -g -O0 -c -Wall -fno-rtti -fno-exceptions -I/usr/include -I../../../../include -I../../../../storage/ndb/include -I../../../../storage/ndb/include/ndbapi ndbapi_scan.cpp
ndbapi_scan.cpp: In constructor ‘Car::Car()’:
ndbapi_scan.cpp:111: error: ‘memset’ was not declared in this scope
ndbapi_scan.cpp: In function ‘void drop_table(MYSQL&)’:
ndbapi_scan.cpp:124: error: ‘exit’ was not declared in this scope
ndbapi_scan.cpp: In function ‘void create_table(MYSQL&)’:
...
Hmmm, some header files are missing, let's add them
#include <string.h>
#include <stdlib.h>
With the added header files, the program compiles flawlessly. Let's try it!
$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.
Unable to connect with connect string: nodeid=0,127.0.0.:1186
Retrying every 5 seconds. Attempts left: 4^C
My bad.... I did a typo in the IP of the MGM server. Easy to correct...
yves@yves-laptop:/opt/mysql-cluster-gpl-7.0.8/storage/ndb/ndbapi-examples/ndbapi_scan$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.1
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
^C
What's that? Wow... look at the create_table code...
void create_table(MYSQL &mysql)
{
while (mysql_query(&mysql,
"CREATE TABLE"
" GARAGE"
" (REG_NO INT UNSIGNED NOT NULL,"
" BRAND CHAR(20) NOT NULL,"
" COLOR CHAR(20) NOT NULL,"
" PRIMARY KEY USING HASH (REG_NO))"
" ENGINE=NDB"))
{
if (mysql_errno(&mysql) != ER_TABLE_EXISTS_ERROR)
MYSQLERROR(mysql);
std::cout << "MySQL Cluster already has example table: GARAGE. "
<< "Dropping it..." << std::endl;
/******************
* Recreate table *
******************/
drop_table(mysql);
create_table(mysql);
}
}
That one is sweet, no need to be a C guru to catch it... it is fairly obvious where the problem is isn't it? Let's remove the recursive "create_table" call.
$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.1
Connecting...Connected!
Initializing NDB...Done
Looking for table GARAGE...Got it! = GARAGE
Getting the columns...
Done
Populating...populate: Success!
Scanning and printing...0 Mercedes Blue
12 Toyota Pink
5 BMW Black
7 BMW Black
3 Mercedes Blue
9 BMW Black
10 Toyota Pink
14 Toyota Pink
11 Toyota Pink
1 Mercedes Blue
13 Toyota Pink
6 BMW Black
2 Mercedes Blue
4 Mercedes Blue
8 BMW Black
scan_print: Success!
Going to delete all pink cars!
Deleting pink cars...No error
Done
0 Mercedes Blue
5 BMW Black
7 BMW Black
3 Mercedes Blue
9 BMW Black
1 Mercedes Blue
6 BMW Black
2 Mercedes Blue
4 Mercedes Blue
8 BMW Black
scan_print: Success!
Going to update all Blue cars to Black cars!
No error
0 Mercedes Black
5 BMW Black
7 BMW Black
3 Mercedes Black
9 BMW Black
1 Mercedes Black
6 BMW Black
2 Mercedes Black
4 Mercedes Black
8 BMW Black
scan_print: Success!
I know that these programs are just examples. But yet, they are in some way part of the documentation and the quality is, let's say, subject to questions.
Entry posted by yves | 4 comments
]]>This HA solution is the easiest to implement and to manage. You basically need to setup MySQL replication between a master and one or more slaves. Upon failure of the master, one of the slaves is manually promoted to the master role and replication on the other slaves is re-adjusted to point to the new master. This solution works well with all the MySQL storage engines including MyISAM (NDB is a special discussed later) but it suffers from the limitation of MySQL replication. The main limitation, in term of HA, is the asynchronous design of MySQL replication which does not allow the master to be sure the slave has been updated before returning after a commit statement. There is a window in time where it is possible that a fully committed transaction has not been pushed to the slave(s) leading to data loss. Many large websites that are fine with some data loss rely on replication for HA and for read scaling.
In addition to hardware failure, the level of availability of this solution is affected by the availability of the MySQL replication link between the servers. Replication often break for various reasons and while replication is broken, there is no High-Availability. Also, the availability of this solution is affected by how much the slaves were behind the master when the outage occurred. So, if you want to have a good level of availability, you need a good monitoring and alerting system to quickly react to replication issue and you need a rather small write load so that the slaves do not lag behind the master too much. To maximize the level of availability, recovery should be automatic.
Apart of its simplicity, an HA solution based on replication as many interesting properties, no wonder it is so popular. First, if the application is well designed and has specific database handles for read and write operations, this HA solution can scales the read operations to a high level. Using the slaves for reads cause a second interesting side effect, the caches of the slaves are hot so failing over to a slave means no degraded performance associated with caches warm up. Finally, it is well known that with MySQL, altering a table means recreating the whole table and it is a blocking operations. Altering a large table may takes many hours. The trick here is to run the alter table on a slave and then, once done, we let the slave catch up with the master using the new table schema, we failover to the slave and repeat the alter table on the other server. Those online schema change are easier when a master to master topology is used.
The following figure summarize the simplest HA architecture using MySQL replication. All writes are going to the master while reads are spread between the master and the slave. Upon failure of the master, replication is stopped on the slave and all traffic is redirected to the slave which now handles reads and writes.

| Pros | Cons |
| Simple | Variable level of availability (98-99.9+%) |
| Inexpensive | Not suitable for high write loads |
| All the servers can be used, no idle standby | read scaling only if application splits reads from writes |
| Supports MyISAM | Can lose data |
| Caches on failover slave are not cold | |
| Online schema changes | |
| Low impact backups |
I already mentioned that for best HA levels, failover or recovery should be automatic. There are tools to manage automatic failover with replication like MMM, Flipper and Tungsten. Here, I will quickly describe the most popular one, MMM.
With MMM, you need to add a separate server, the Manager that, like the name imply, manages the availability of the MySQL service. A high availability solution based on MMM requires at the 2 MySQL servers configured in a Master to Master topology. Additional slaves can also be added. A MMM agent runs on all the MySQL servers and it is used to do OS level operations. The principle of operation of MMM is based on VIPs. There is one write VIP, where write operations are sent and as many read VIPs as the number of MySQL servers. For the write VIP, MMM monitors the state of the current master and, upon failure, try to kill all the connections to the failing server and transfer the write VIP to the other master. For the read VIPs, MMM monitors the state of the slaves and remove the read VIP of a slave if it has failed or is lagging behind the master by more than a defined threshold. One of the main limitation of MMM is its lack of fencing capability. It is important to stop all the connections to the failing master and if that server is not responding, maybe because of a network problem, a stonith device must be used to fence it. I am far from being an expert with MMM, other guys on my team are way better than me, but I heard that the MMM v1 code base had some deficiencies. MMM v2 is a complete rewrite that addresses some of the shortcomings of v1. Walter Heck from OpenQuery gave an excellent webinar on it recently.
The architecture of a highly available setup using MMM and Master-Master replication is presented on the figure below. Apart from the minimum requirement of two MySQL servers replicating each other, there is a third server, called the manager, that controls both MySQL server through an agent that is running on each server. The manager controls and monitors the state of the replication and assign virtual IPs for specific roles. There are one VIP where write operations are sent and two or more VIPs where read operations are sent. If replication on one of the MySQL servers lags behind too much, its read VIP will be moved to another server.

As a conclusion, replication can be used in many cases to build effective and scalable highly available solutions but it has some limitations. In my next blog post, I'll present another HA solution build around Heartbeat and DRBD.
Entry posted by yves | No comment
]]>