May 9, 2007

Merge Tables Gotcha

Posted by peter |

I had the interesting customer case today which made me to do a bit research on the problem.

You can create merge table over MyISAM tables which contain primary key and global uniqueness would not be enforced in this case, this is as far as most people will think about it. In fact however it is worse than that – if you have same key values in underlying base tables some of the queries may give you wrong results:
[read more...]

How to estimate time it takes Innodb to Recover ?

Posted by peter |

Today seems to be Innodb day in our Blog, but well this is the question which pops ups quite frequently in Innodb talks and during consulting engagements.

It is well known to get better performance you should normally set innodb_log_file_size large. We however usually recommend caution as it may significantly increase recovery time if Innodb needs to do crash recovery.

I’m often something like what recovery time would expect for 512MB total log file size. This is however question with no exact answer as there are many things which affect recovery time.

For example Paul mentioned YouTube has something like 4 hours recovery time even if logs are of reasonable size and wikipedia has 40 minutes or so with innodb_log_file_size=256MB. In other cases I know same log file time may have recovery time of 5-10 minutes.

To understand things affecting recovery time you better to understand how Innodb recovery works.

When Innodb recovers from logs it finds the last checkpoint time. Innodb has fuzzy checkpointing and yet there is particular log sequence number up to which all modifications are already reflected in the database. Changes which have longer log sequence number may be already in the database and may not be as buffer pool flushes are not predictable.

As Innodb is scanning log files from last checkpoint time it looks at records and checks if these are already applied to the pages (Innodb log format has page numbers stored together with operation info), if it has not it performs operation specified in the log applying changes in the buffer pool.

Such process means the following variables are important.

Checkpointing interval This is mainly controlled by size of log files as Innodb needs to flush database pages to the tablespace before it can overwrite matching log records in the log file (which is circular).

Size of log records While we speak about log file size it is number of page updates operations which is important and it can be different depending on your record size and workload. If you have tables with short rows your log records will likely be more compact and so same log length will contain more log records which will need to be replayed.

Data Access Locality This is another extremely important variable – if updates you’re performing are happening to some small amount of pages (for example you’re doing batch update for relatively small table) less IO will be required as pages already will be in the buffer pool. I think the reason why YouTube had so long recovery time is because their application was well optimized to accumulate updates in the cache and so having pretty scattered updates instead updating view counter for popular movie hundreds of times per second.

Database Size This comes back to the data locality but with same data access distribution the larger database you’ll have the more scattered records will be in the end thus larger database tend to recover longer.

Buffer Pool Size This affects recovery time from two parts aspects – first if you have small buffer pool you will have a lot of pages being flushed from buffer pool and so you may end up with flushes being more frequent. I’m not however exactly sure if checkpoints will become more intensive in this case. It also affects recovery speed during recovery time – the larger buffer pool you have the more pages it can hold and so less IOs will be needed. But even if you have enough buffer pool to cache all database it will not magically make recovery time instant because to populate buffer pool still a lot of random IO will be required.

Number of dirty buffers during the crash Even though this looks like something which has serious impact in reality it is not that significant. If checkpoints are not made more often Innodb will still need to scan same portion of log files and perform same amount of checks if page was already flushed since last checkpoint. True – smaller amount of pages will need changes to be applied to them, but this is usually minor part of the workload compared to anything else as page is already in the buffer pool at this stage. Of course it affects time needed to flush buffer pool when recovery is complete but again this does not normally take major part of recovery time.

It is worth to note Innodb optimizes log recovery by reading portions of log records and sorting them to apply them to the pages in more sequential order.

Besides log replay phase there is also phase of flushing dirty buffers to the disk and undo phase – rolling back uncommitted transactions. As MySQL 5.0 these seems to be done in the background.

In the future to increase recovery time even more I hope Heikki will enable redo and undo phases to be performed in the parallel, which should help a lot on systems with large number of hard drives.

So how do you estimate how long it takes Innodb to Recover ? Now you know there are multiple of variables which all may be hard to take into account on paper so in my experience best time is to run some benchmarks – apply the load you expect system to handle and crash Innodb few times to see how long it takes Innodb to recover. I’d recommend to crash by power cycling if possible because this makes sure all OS/RAID caches are clean when recovery happens and also may help you to catch write cache bugs.

MySQL Users Conference – Innodb

Posted by peter |

It might look like it is too late to write about stuff happened at Users Conference but I’m just starting find bits of time from processing accumulated backlog. The Theme of this Users Conference was surely Storage Engines both looking at number of third party storage engine presented, main marketing message – Storage Engine partnership with IBM, “Clash of Database Egos” – Storage Engine developers showcase as one of keynotes.

Today let me start with most popular transactional storage engine for MySQL – Innodb.

Innodb Storage Engine was covered in a lot of talks, many of them done by Innodb users. I found these items forth interest

Innodb Zip Page Compression This feature was in development for a while. I believe it is at least 3rd Users Conference it is being talked about, but now it is working enough to do some tests. There seems to be still fair amount of work required to make it work well such as having IO and uncompression happening in parallel, improving buffer pool usage (currently you may need to both store Compressed and Uncompressed versions of the page in the buffer pool in many cases).

I also should mention this feature is not that easy to use as it is implemented now – you have to specify compressed block size, this is the size to which you expect each page to compress. If you choose this number too high the compression ratio will not be good if it is too low there will be a lot of pages which can’t be compressed to specified page size and will have to consume more than one page thus reducing performance. This is especially the problem as you have to specify single page size for Innodb table, while data pages and pages for different indexes may well have very different compression ratio. As Innodb indexes are not prefix compressed they have very good compression potential.

I also would like to see some stats about compressed tables, such as what is the average size pages were compressed to for data and for various indexes. How many “fragmented” pages do we have in the index with current compression settings. The same utility could be used to advice on the compression page settings by looking at compressed page size distribution you can tell which page size will be optimal – will allow to get maximum compression while limiting amount of pages which do not fit in compressed space.

Fast Index Creation This is another feature which I was asking Heikki about probably since my first days as Innodb user when I found out ALTERing Innodb tables may take a lot of time. Fast Index creation will allow Innodb to build indexes by sorting rather by row by row insertions as it currently does which can be multiple orders of magnitude faster for large tables.

There is fair number of questions about this feature which I do not yet know answer for – for example will LOAD DATA INFILE be optimized same way as for MyISAM tables by separate phase of building Indexes ? Will be UNIQUE indexes built by sorting in case of ALTER TABLE (the big Gotcha for MyISAM tables). Are there any plans to be able to build indexes in parallel by running sorts in different threads – using multiple cores becomes increasingly important as their number per CPU just keeps doubling. It is also rather interesting how is sort file structured for building the indexes – does it uses dynamic or static record format ? (static record format for sort used by MyISAM may cause sort files to be much larger than table itself)

The index build speed is however not only reason I can’t wait for this feature to be available. Building index by sort allows to build them physically sorted so they should be much faster for IO Bound Index scans and large range scans. We surely should benchmark this aspect separately.

Improved Auto Increment handling Yet another known Innodb problem is table locks which are taken for statement duration for Inserts with auto-increment column. This starts giving you problems if you have a lot of concurrent inserts happening to the same table.

This was born same as bunch of others due to statement level MySQL Replication which among other requirements, needs all auto-increment values in multiple value insert to be sequential.

To be honest Heikki probably took a bit of shortcut and kept code simple and same for all insert cases – in fact if the insert is single value insert or the number of values in the bulk is known you can allocate all values at once when statement is started instead of handling auto increment lock for full statement duration.

The problem with multi value inserts however is the following – even if you know number of values in insert statement you do not know if all of them need autoincrement value allocated, some may already have values specified. I’m not sure if Heikki will find a way to count exact number of auto increment values needed or if we get behavior change by having potential “holes” when auto increment values are assigned but never used. There are probably similar problems in INSERT IGNORE and ON DUPLICATE KEY UPDATE cases.

Now In MySQL 5.1+ it is also possible to fix this problem for cases when number of rows in auto-increment batch insert is not known as we do not have to allocate sequential auto increment values if row level replication is used. This however will be again behavior change to watch out for.

So these are all possibilities but are we expected to have the problem fixed soon ? Yes indeed there is patch out where already which may come in MySQL 5.2 or even MySQL 5.1. I surely would like to see it sooner at least in more flexible community version.

Relaxing Locks Row level replication allows to get relaxed locks in many other cases. Gap or next-key locks which limit concurrency in many cases can be removed for READ-COMMITED isolation mode which will start offering more performance benefits in MySQL 5.1 It also should be possible to start unlocking rows if they do not match where clause (think for example about running not-indexed UPDATE which has to lock all rows in Innodb right now) I however did not get exact plans on this one.

In general I should mention Innodb is being blamed for a lot of “Gotches” while many of them come from MySQL limits at the time when Innodb was first implemented. Many other transactional Storage Engines do not have the problems because they target at least MySQL 5.1 with row level replication readily available.

Speaking about community items the following were most interesting:

Further Scalability improvements Yasufumi Kinoshita did number of interesting benchmarks stress testing other areas in Innodb than buffer-pool which had it locking relaxed in latest MySQL 5.0 Hopefully the patches he has provided will be merged soon if not we should test them and make binaries with them available. With current MySQL 5.0 I still see Innodb having serious scalability problems in some environments with 4+ cores so it is still important.

IO Scalability patches Mark Callaghan from Google
released so many MySQL patches so we can call patched version GoogleSQL. Among them there are number of patches to make Innodb IO scheduling more aggressive which is important for serious IO subsystems. Many Innodb IO settings were optimized on systems with 1-2 SATA drives so they are far from optimal for server grade IO subsystems.

These are for MySQL 4.0 at this point but we hope to port some of them to MySQL 5.0+

Heikki also mentioned there are more things inside Innodb to unleash performance such as multiple purge threads or parallel log recovery features, which he however needs to test properly before really enabling it.

Innodb IO Tuning I should also mention Paul Tuckfield Keynote about Scaling Innodb at YouTube. There were many other interesting points in his talk besides Innodb but he also very well described (with pictures) while RAID10 is better than RAID5, why large stripes are better and why battery backed up cache is so important. I also usually speak about these in my talks but I’m very bad with pictures.