My first thought was that the customer had deleted from the table, which leaves "holes" in the middle of it and prevents concurrent inserts. (You can configure the server to permit concurrent inserts even when there are holes, but it's disabled by default.) However, that turned out not to be the cause; the table was only inserted into (and selected from). Instead, the blocked statements were because of INSERT... SELECT statements that were running against the table, selecting data from it and inserting into another table.
Let's look at what happens here: suppose you have two tables tbl1 and tbl2 and concurrent inserts into tbl2 are running fine. If you now run the following query,
The concurrent inserts into tbl2 can block. This happens if you have the binary log enabled. If you think about it, this makes sense and is correct behavior. The statements have to be serialized for the binary log; otherwise replaying the binary log can result in a different order of execution.
The MySQL manual actually says this, but not in the clearest way. It just says
If you are using the binary log, concurrent inserts are converted to normal inserts for CREATE ... SELECT or INSERT ... SELECT statements.
If you use mysqladmin debug, you'll see an ordinary SELECT gets a lock on the table like this:
But on INSERT...SELECT, you'll see this:
That read lock is what's blocking the concurrent inserts from happening.
There's no solution to this, if you need the binary log enabled. (It needs to be enabled for replication.) There are workarounds, though. You can use the old trick of SELECT INTO OUTFILE followed by LOAD DATA INFILE. You can use InnoDB instead. Or you can do something more elaborate and application-specific, but that's a topic for another post.
Entry posted by Baron Schwartz | 3 comments
]]>I am extremely happy to hear these news ! This is good for MySQL as a company, MySQL customers and MySQL users.
I'm hoping Community feedback was serious contributer to this decision, though I know there were a lot of Internal discussions as well. In any case this sends a great message to community - Speak up and you may be heard.
I also hope Marten Mickos took this decision being convinced rather than getting the order from the top as this is only one battle in "what is going to be opensource" war
Anyway thank you everyone who made this happen, in particularly Monty, which I know fought a lot for this.
P.S. This is great news but I'd like to see and know more. Looks like servers is left alone being Open Source, what is about MySQL WorkBench, MySQL Proxy Extensions, MySQL Monitoring software ?
Entry posted by peter | 2 comments
]]>What are we expecting from MySQL Performance Engineer ?
This remote position which is open for candidate worldwide. We also do not require perfect Spoken English for this position, though candidate should be able to read and write in English language pretty well.
If you are such a person or know someone who might be interested, drop us a note.
You can learn more about our company at Percona web site.
Entry posted by peter | No comment
]]>But lest first start with feature request for Innodb Team: All ways I mention here are hacks and they can't be as efficient as native support. It would be great if Innodb would implement command to preload table to Innodb buffer pool, which would simply go through .ibd file sequentially and inject pages in the buffer pool. This would make preload done using sequential file scan even if indexed suffered a lot of page splits.
Now lets continue to the hacks
So As I mentioned you can load Innodb Table Clustered Index in the buffer pool pretty efficiently by using something like SELECT count(*) FROM tbl WHERE non_index_col=0 This works relatively well (though can be slow for fragmented tables) but it does not preload indexes in memory neither it does externally stored objects - BLOB and TEXT fields.
If you would like some non PRIMARY Indexes preloaded you can use something like SELECT count(*) from tbl WHERE index_col like "%0%" for each index. Only one such query per index is enough even if it is multiple column index.
To fetch BLOB/TEXT externally stored columns you can use similar query: SELECT count(*) from tbl WHERE blob_col like "%0%". Note if you preloading BLOB/TEXT columns you do not need to use first query I mentioned because scanning potentially externally stored blobs will also scan Clustered key Anyway.
Now, say you have bunch of tables having few indexes - should you run multiple queries in parallel to get best preload speed ?
It depends - depending on key/clustered key fragmentation it may be faster to run queries one by one (keeping IO more sequential) or run multiple queries at once to get more outstanding requests at the same time - benchmark to find out.
If you just need to preload single large table you can chop it into several ranges and preload in parallel, such as SELECT count(*) FROM tbl WHERE id BETWEEN 1 and 10000000 AND non_index_col=0
Entry posted by peter | 6 comments
]]>To do it I run the query: "SELECT count(*) FROM tbl WHERE non_idx_col=0" I use this particular form of query because it will do full table scan - running count(*) without where clause may pick to scan some small index instead.
If your table is not fragmented one of two things should happen - either you should be reading at your hard drive sequential read rate or you would see MySQL becoming CPU bound if IO subsystem is too fast.
In this case however I saw neither - The vmstat showed read speed less than 10MB/sec which is very low for this system which had 6 15K SAS hard drives in RAID10.
Another indication of bad fragmentation was average IO size seen in SHOW INNODB STATUS output. It was around 20KB which means most reads are single page (16K reads). In case of non fragmented table you would see Innodb sequential read-ahead kick in which does reads in 1MB blocks and so you would see average IO size in hundreds of KB.
Now it is worth to notice you can see poor sequential scan performance even if table is not logically fragmented and Innodb is reading data in large blocks - this can happen in case Innodb table file is itself fragmented.
To check if this is the case I usually do "cat table.ibd > /dev/null" and watch IO statistics. If you see small IO request sizes in iostat and simply read speed. Like for the customer in question I saw file read speed of about 50MB/sec which is of course much better than 10MB/sec but well below RAID array capacity.
To check if file fragmentation is the issue or it is poor or miss configured IO subsystem I do another check by running cat /dev/sdb1 > /dev/null - Physical hard drive should never suffer fragmentation so you can get as much sequential IO as you can get (using IO pattern "cat" uses). In this case I got about 300MB/sec which confirmed file fragmentation is also the issue.
Interesting enough the "cure" for both fragmentation issues is the same - OPTIMIZE TABLE tbl - this command recreates the table by writing the new .ibd file (if you're using innodb_file_per_table=1) which normally would be much less fragmented because it is written at once. Too bad however it requires table to be locked while it is being rebuilt and also it really only defragments clustered key but not the index.
P.S It would be cool to get Innodb objects (data and Index) fragmentation statistics which actually should not be that hard to implement.
Entry posted by peter | 9 comments
]]>But thing a lot of people miss is - Being Scalable is Not Enough - you need to scale from reasonable base to claim the good performance, and this is where T2000 performs subpar in many cases.
I often hear about people complaining queries take much longer on T2000 compared to recent Intel or AMD CPUs when there is no concurrent load - It is reported T2000 can be as much as 5-15 times slower in this case depending on the workload.
Here is example run of purely CPU consuming "Benchmark" function for 2.6Ghz Intel Xeon vs T2000:
As you can see this is hell a lot of difference !
Depending on your application performance with single thread may be important or non important for you - it is surely important for the slave if you're having active replication, if you're running time sensitive long running CPU bound queries or if queries contribute significant time to generating web page.
For example if on Xeon queries take 50ms to generate the page, the MySQL Latency you may see on T2000 may be as high as 500ms which would be well above performance guidelines for many web applications.
I'm hearing Sun is working on new CPUs which would offer significantly higher single thread performance, but at this time I have to be very careful advising this platform to the customers.
Entry posted by peter | 12 comments
]]>Sure they do. There are people searching for advice finding it on MySQL Performance Blog and so they do not need to purchase any commercial consulting. However these people are not likely clients on the first place - if not our blog they would find information in MySQL Manual or on thousands of other sites, or may be they would read some books.
Seriously speaking for vast majority of the problems you can run into there is information out there - so you only need to find it, filter the one which is authoritative and matches your case and then figuring out how to apply it.
In general "Googling" usually works for solving simple issues which can be solved by applying ready recipes, for example basic MySQL Configuration tuning. If the problem which you're having is complex it may require experience and background to quickly find right solution for the problem.
We know extensive knowledge is important for complex cases. If we get sore throat (simple problem) we just go to the store and get over counter drug. However if we get red dots on the skin the way we did not see before we would likely go to the doctor to for advice. There are many things which causes such symptoms and special skills is required come with diagnoses. There are surely a lot of information on the Internet which would fit description "red dots on the skin" but special skill is required what you're dealing with.
Happily computers are not humans and they do not break that badly so you can try different "medicine" for your problem though still you unlikely will have possibility to discover and try all possible recipes.
Besides experience you really need the constant practice. Doing MySQL Optimization I have to keep thousands of facts in my head which apply to the area. Many of them would be complex, something like "Innodb uses row level locks, unless you're doing insert in table with auto increment which uses table level auto increment locks unless you're running MySQL 5.1 with..." To analyze problems efficiently you need your brain to have these facts available and apply automatically to provide you with "blink", intuitive decision regarding what is likely cause of the problem or what best decision would be.
If you would sit and read MySQL manual you would get large of these facts in your head, however without constant use they got quickly forgotten. This is another reason having information available does not cause competition.
Though these arguments apply to good consultants - skilled and experienced You may not be hiring the consultant, but robot which executes "internal instructions" working on your issue. In this case surely having such instructions public can hurt business badly. In other cases value could be access to the information which is not easily available, though I think it does not apply to the industry of Open Source software consulting we're working on.
In general I would encourage all consultants to write more about their experience and do not get scared about diminishing their value by disclosing secrets. It does not happen and return from publishing is so great !
Entry posted by peter | 8 comments
]]>Enjoy !
Entry posted by peter | One comment
]]>I personally think DRBD has its place but there are far more cases when other techniques would work much better for variety of reasons.
First let me start with Florian's comments on the issue as I think they are most interested ones.
First lets get to the point what we're comparing here - it is mainly DRBD to MySQL Replication based techniques (lets leave MySQL Cluster and Continnuent alone for a while as these are a bit from different league). It is not the question if DRBD is better than SAN - it offers more independence compared to SAN and in my view surely superior from HA point of view but this is not the point of comparison.
“Failback could destroy the original master too”, however, is plain false. DRBD won’t “destroy the original master” any more than it already was if the filesystem on top of DRBD was fried beforehand.
Let us again compare MySQL Replication to DRBD in this case - in both cases due to some reasons you can have systems to run out of sync and have conflicting updates applied to them. With DRBD you have the choice of killing one of the nodes and re-syncing from another one while with MySQL Replication you can use Maatkit to merge the changes after all, also you can review binary logs to see which updates were applied to different nodes.
Transaction log replay, yes. But fsck? These days this amounts to running a journal replay. Takes under a second in most circumstance
I would put it 10 seconds but it does not matter. The transactional log reply is likely to take much longer than that. This is very bad property of DRBD - besides the well understood overhead of committing on both nodes instead of one you also meat the tough choice - you've got to pick either long recovery time or further degraded performance. In the large databases I run in production relaying on MySQL Replication for HA I often have 15+ minutes reply of Innodb Transactional log which would be a huge bummer with DRBD.
I would also say this implies hidden danger - the time it will take your database to do transactional log recovery is invisible until you get a crash, meaning if your production database size growths load changes or you happen to have failure during activity of certain kind you might have it taking much longer when expected. recovery time depends on a lot of variables.
The side question about it is of course the fact you have to be picky about storage engines you're using - DRBD does not work with MyISAM (check required) so you need to have processes to ensure your application does not uses this storage engine which may be hard to guaranty in many environments when development has too much autonomy.
I must note in this aspect however DRBD goes on par with MySQL Statement Based Replication - it is also well too easy to use MySQL features which break replication.
The failover node is a hot standby, it’s just not a running slave node from the database’s standpoint. And, nothing stops you from running two databases on two servers on two DRBD devices laid out in a “criss-cross” fashion, converging on one node in case of node failure.
This actually goes to two topics. First - hot vs cold. If you're using decent hardware and care about performance you use O_DIRECT with Innodb which makes it to bypass file cache. If you have it DRBD slave will be fully cold. But lets assume you're ready to pay for yet another penalty DRBD introduce and do not use this option wast memory and CPU cycles on double caching. Even in this case DRBD slave node can't be called hot because write load often does not touches the same data as read load. Here is simple example - assume you're inserting the data at the same time as running reporting queries on last month. All last month will be hot on the slave which is doing reads but only last few hours will be hot on the standby box.
Running two instances on the server allows to reduce hardware waste with DRBD, though not eliminate because you get some disks which you can't really use for anything else rather than HA. Two instances also complicate things - depending on infrastructure it can be seen as almost no complication or quite serious complication.
"Cannot do maintenance on cold standby database."
But you can do anything you want with a database that you run off a DRBD LVM snapshot. Works on a Secondary node too.
I'm not sure Florian understood what was meant here. With MySQL Master-Master replication I can add and index on passive node, wait it to catch up and switch the roles (see another post) you can't really do this with DRBD as this requires logical level of operation to work.
This is not to mention other things you can do with MySQL Replication, such as filtered replication or cross storage engine replication, though which are not typically used for HA purposes. Time Delayed replication is however something quite helpful for some environments. Though DRBD also could be extended to support one if needed.
Now, do not get me wrong DRBD is great, and thanks to Florian for following up and making sure myths about DRBD do not spread too wide.
So when I would recommend to use DRBD with MySQL ?
There some good reasons to use DRBD with MySQL though as I mentioned I do not view it as first choice solution.
First it is good choice for organizations which are got use to SAN based high availability solutions with active-passive management software. Quite often these guys would have be familiar with such HA concept and it would be very natural for them to use same approach for MySQL as they use for PostgreSQL for example instead of investing time to learn about MySQL Replication or just looking to keep MySQL infrastructure to be as close to one for other databases in use as possible.
Second - it is often inevitable choice when you can't avoid loosing any transaction - period. Some people would rather stand longer fallback time (as with DRBD) but would not like to have lost transactions which may happen with async replication. Another similar case is when you're looking to ensured consistency - MySQL Replication can out of sync - and there is bunch of tips in documentation of how to do it. With DRBD the chance of nodes running out of sync is minimal and can be caused by software and hardware bugs rather than known limitations.
You could argue depending on cases you spend most time working with how much cases do these correspond - some people mainly deal with systems which can't accept any transactions loss and for these DRBD often comes as a first choice if you have more experience with traditional web shops - these usually would prefer to lose one user comment a year instead of paying extra performance costs.
It is worth to note DRBD also allows building very nice mixed environments with MySQL - for example you can replicate binary logs using DRBD so if master node fails you have not lost transactions - when you can use such logs to do point in time recovery or to catch up to last few not committed transactions on the slave. We should spend some time implementing such script sometime which could be helping to get best of both worlds.
But currently - You can't have it all
The state of High Availability Solutions for MySQL these days is - you can't have it all. There is no OpenSource solution out where which would offer you full redundancy use of both nodes at least for reads no transaction loss and automated fail over. Whenever you're using MySQL Cluster, DRBD or MySQL Replication you have to have some compromises.
Entry posted by peter | 11 comments
]]>In short, a query that could take several days to run with one join order takes an hour with another, and the optimizer chose the poorer of the two join orders. Why is one join order so much slower than the other, and why did the optimizer not choose the faster one? That's what this post is about.
Let's start with the MySQL query optimizer. The optimizer tries to choose the best join order based on its cost metric; it tries to estimate the cost for a query, then choose the query plan that has the lowest cost. The unit of cost for the MySQL query optimizer is a single random 4k data page read. In general, it's a pretty good metric, but it has one major weakness: the server doesn't know whether a read will be satisfied from the operating system cache, or whether it'll have to go to disk. (This distinction is abstracted away by the storage engine; the optimizer doesn't know how a given storage engine stores its data).
I'll try to omit the details and keep this clean. Let's take a look at the tables.
It's a big fact table and two fairly small dimension tables, which is normal. Here is the query:
There are indexes on all the columns in all the ways you'd expect: all the dimension columns are indexed on every table, and there's a separate index on every column in the WHERE clause. Here's the query plan initially.
This query will run for days and never complete. No one ever let it finish to see how long it would run.
How do I know it will run for days? Here's my train of thought:
Ouch! That's slow.
Now let's look at the alternative: table-scan the fact table, and do index lookups in the two dimension tables. MySQL doesn't want to choose this join order, so we'll force it with STRAIGHT_JOIN:
As we saw in the previous post, which I linked at the top of this post, we can scan the fact table in less than an hour. And it turns out that joining to the dimension tables doesn't slow the query perceptibly, because these tables are small and they stay in memory, in the OS cache. (They don't get evicted from memory by the cache's LRU policy, because they are frequently used -- once per row in the fact table. The LRU policy evicts old blocks from the fact table instead, which is perfect -- these blocks are used only once and not needed again, so they can be replaced).
The difference between the two queries -- 55 minutes and 2.6 days -- is basically the difference between scanning data sequentially on disk and random disk I/O.
So now you know why one join order is faster than the other. But why didn't the optimizer know this, too? The optimizer does know that random access is slower than sequential access, but it doesn't know that the dimension tables will stay in memory, and this is an important distinction.
Let's put ourselves into the mindset of the optimizer. We'll assume that every join to the dimension tables will go to disk instead of being read from cache. Now the STRAIGHT_JOIN becomes a table scan of about 313 sequential reads (150 million rows / 117 bytes per row / 4096 bytes per read), plus about 150 million random I/Os for the first dimension table, plus 150 million random I/Os for the second dimension table. That's 300 million random I/O operations.
In contrast, the optimizer chose a plan that it thought would cause only 11.3 million random I/O operations.
The optimizer was being smart, given its lack of knowledge about the OS cache. This is why an expert is sometimes needed to provide the missing information. If the MySQL optimizer were right and each of these had to go to disk, our STRAIGHT_JOIN plan would take more than a month to complete! Good thing we know the difference between cache and disk!
Entry posted by Baron Schwartz | 13 comments
]]>