May 29, 2006

Join performance of MyISAM and Innodb

Posted by peter

We had discussion today which involved benchmarks of Join speed for MyISAM and Innodb storage engines for CPU bound workload, this is when data size is small enough to fit in memory and so buffer pool.

I tested very simple table, having with about 20.000 rows in it on 32bit Linux. The columns "id" "i" and "c" were populated with same integers so we can allow the same job to be done using different kinds of columns - primary key, integer indexed column and indexed char column. The query is also trivial - the point was to make sure it is not index covered query so it reads the rows and it does not return many rows. I varied the join clause to be id, i and C columns appropriately.

SQL:
  1. CREATE TABLE `t1` (
  2.   `id` int(10) UNSIGNED NOT NULL DEFAULT '0',
  3.   `i` int(10) UNSIGNED NOT NULL DEFAULT '0',
  4.   `c` char(15) DEFAULT NULL,
  5.   `pad` char(8) DEFAULT NULL,
  6.   PRIMARY KEY  (`id`),
  7.   KEY `i` (`i`),
  8.   KEY `c` (`c`)
  9. ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
  10.  
  11. SELECT count(t1.pad),count(t2.pad) FROM t1,t1 t2 WHERE t1.id=t2.id;

The result I've got are as follows

Storage Engine ID I C
MyISAM 0.24s 0.27s 1.19s
Innodb 0.07s 0.30s 0.38s

As you see in such circumstances Innodb is actually faster than MyISAM in 2 cases out of 3. I guess the reasons are the following:

  • Innodb primary key joins are very fast as data is clustered together with index and generally highly optimized
  • Innodb builds hash indexes which helps to speed up lookup by indexes by passing BTREE index and using hash, which is faster
  • MyISAM does compression for character keys which makes it perform slower for random lookups
  • MyISAM generally has lower processing overhead due to its simplicity
  • MyISAM still a bit better by primary key join than for secondary key join. I guess because it knows for sure there is no more than one row which matches the index, so there is no need for MySQL to request next row matching index

Note: This applies to CPU bound workload with all content fitting in memory. In other cases situation is very different and MyISAM compression for char keys could frequently positevely impact performance.

INSERT ON DUPLICATE KEY UPDATE and summary counters.

Posted by peter

INSERT ... ON DUPLICATE KEY UPDATE is very powerful but often forgotten MySQL feature. It was introduced in MySQL 4.1 but I still constantly see people unaware of it.

Myself I like this feature big deal because it is designed in truly MySQL style - very efficient solution for freqent task while keeping it beautiful and easy to use.

So what is this feature great for ? Well any kind of maintaining counters. If you're writing traffic accounting it could be traffic and number of packet passed for given port or IP address. For Web Applications it could be counting number of visits per page or IP address, number of times particular keyword was searched etc.

This functionality also makes it very easy to do incremental single pass log file processing and building summary tables.

Here is example:

SQL:
  1. CREATE TABLE ipstat(ip int UNSIGNED NOT NULL PRIMARY KEY,
  2.                           hits int UNSIGNED NOT NULL,
  3.                           last_hit timestamp);
  4.  
  5. INSERT INTO ipstat VALUES(inet_aton('192.168.0.1'),1,now())
  6.                        ON duplicate KEY UPDATE hits=hits+1;

This example actually shows one more neat feature of MySQL - inet_aton and inet_ntoa functions which can convert IP address strings to integers and back. This allows to save on field length significantly by using 4 bytes instead of 15

The third feature this example takes advantage of is TIMESTAMP field. By default first TIMESTAMP column will have its value automatically updated to current timestamp on insert and update. We actually could have omitted now() in insert clause but this would require to specify list of columns which we skipped for sake of example.

So how would this example work ? Well just you would expect it. If there is no such IP address in the table it will be added with hits=1 if it is already where (note ip is PRIMARY KEY) it would be just incremented and last visit timestamp updated.

The benefit of using this feature insted of INSERT + UPDATE could be different, depending on number of new rows and data set size. 30% speedup should be typical. Performance increase is not the only benefit - what is even more important the application code becomes simplier - less error prone and easy to read.