Do you always need index on WHERE column ?
I believe we wrote about this before, but this topic popups again and again.
Today I've read opinion that if we have clause WHERE has_something=1 we should have index on column `has_something` (the column has two values 0 and 1).
In reality the right answer is not so simple.
Let's look next table
-
CREATE TABLE `testr` (
-
`id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
-
`name` varchar(32) NOT NULL,
-
`has_something` tinyint(3) UNSIGNED NOT NULL,
-
PRIMARY KEY (`id`),
-
KEY `has_something` (`has_something`)
-
) ENGINE=MyISAM
with 20.000.000 records.
And in first case has_something=0 for 90% of rows (with random distribution)
-
mysql> SELECT cnt0/cnt FROM (SELECT count(*) cnt0 FROM testr WHERE has_something=0) t, (SELECT count(*) cnt FROM testr) t1;
-
+----------+
-
| cnt0/cnt |
-
+----------+
-
| 0.9001 |
-
+----------+
-
1 row IN SET (7.56 sec)
Let's check execution time with and without index
mysql> select count(name) from testr force key (has_something) where has_something=0;
+-------------+
| count(name) |
+-------------+
| 18001245 |
+-------------+
1 row in set (35.96 sec)
mysql> select count(name) from testr ignore key (has_something) where has_something=0;
+-------------+
| count(name) |
+-------------+
| 18001245 |
+-------------+
1 row in set (10.46 sec)
As you see with index the time is by 3.5 times slower.
Good that mysql in this case choose do not use index
-
mysql> EXPLAIN SELECT count(name) FROM testr WHERE has_something=0;
-
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
-
| 1 | SIMPLE | testr | ALL | has_something | NULL | NULL | NULL | 15000000 | USING WHERE |
-
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
-
1 row IN SET (0.00 sec)
Let look the case when has_something = 0 for 50% of rows.
mysql> select count(name) from testr force key (has_something) where has_something=0;
1 row in set (20.27 sec)
mysql> select count(name) from testr ignore key (has_something) where has_something=0;
1 row in set (10.62 sec)
query with index is still 2 times slower.
and this time mysql is going to use index in execution plan:
-
mysql> EXPLAIN SELECT count(name) FROM testr WHERE has_something=0; +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
-
| 1 | SIMPLE | testr | ref | has_something | has_something | 1 | const | 8890716 | |
-
+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
-
1 row IN SET (0.00 sec)
What about 30% rows with has_something=0 ?
mysql> select count(name) from testr force key (has_something) where has_something=0;
1 row in set (12.36 sec)
mysql> select count(name) from testr ignore key (has_something) where has_something=0;
1 row in set (10.51 sec)
Still query without index is faster.
And finally for case with 20% rows with has_someting=0
mysql> select count(name) from testr force key (has_something) where has_something=0;
1 row in set (8.39 sec)
mysql> select count(name) from testr ignore key (has_something) where has_something=0;
1 row in set (10.43 sec)
So only in the last case we really need the index on column `has_something`
14 Comments











del.icio.us
digg
Very surprising for me! Can you explain the result?
Comment :: August 28, 2007 @ 12:17 pm
Christoph,
The reason why index scan is slower than table scan is that in case of the index scan the MySQL needs to perform more operations like :
read index, get pointer to row, get data from row,
that adds overhead to the operations.
In case of table scan MySQL just goes trough the table and read all rows in continuous mode.
When count of scanned rows more than 20% (in our example) the overhead of index scan makes it non-effective.
In real queries the right percent of rows depends on many factors, but for random uniformly distributed values I would say it is in 20-30% interval.
Comment :: August 28, 2007 @ 1:39 pm
What do you find surprising ?
Full table scan takes constant time no matter the cardinality, index traversing speed depends on the index cardinality.
Note in this case it is not pure index scan (”Using index” in explan) but access of the large portion of the rows in the table in non sequential order.
The 3.5 difference we see here matches given case - when data fits in memory so we’re mainly measuring CPU overhead if we would get random disk IO performance difference can be as large as 100 or even 1000 times !
There is other interesting issue - note name is declared NOT NULL so count(name) should be same as count(*) and so covering index could be used while MySQL is not smart enough for this.
Comment :: August 28, 2007 @ 1:39 pm
Does anyone know if there are plans to include index statistics in the optimizer decisions in the future? Seems like this is a clear case where an optimizer with index cardinality stats should have recognized that the index wasn’t helpful and done a full table scan instead.
Comment :: August 28, 2007 @ 4:21 pm
Thanks for covering this Peter. This is one area where many people get confused and stay confused.
Indexes are of no use if selectivity is down (which happens when there are a few unique values for the column. Anyone confused on this should read more on selectivity.
Frank
Comment :: August 28, 2007 @ 8:38 pm
Thanks a lot for the detailed answers! Now it’s much clearer for me.
Comment :: August 29, 2007 @ 12:13 am
There is a sniplet on MySQL forge to find these useless/harmfull indexes. http://forge.mysql.com/snippets/view.php?id=85
Comment :: August 29, 2007 @ 5:53 am
Thanks Mark,
Even thought this is post by Vadim not me
The other important thing to remember about selectivity - it is actually selectivity of value what matters. It is well possible to have column with 2 values indexes if one of them is in 99% of values and other is in 1% and have index well used to retrieve the data for that 1%
Too bad MySQL does not have partial indexes though (I mean so only that 1% could really be indexed)
Comment :: August 29, 2007 @ 6:46 am
[...] is another blog written by MySQL consultants based in Europe. This blog is focussed more on the performance tuning tips for MySQL. Great resource to learn the various settings and variables in MySQL that can be used to [...]
Pingback :: August 29, 2007 @ 6:10 pm
[...] is all about adding indexes on your WHERE clause columns. Think again! The MySQL Performance Blog posts about cases where an index might not be such a hot idea. This applies to Oracle as well of course, where sometimes a b-tree index just won’t cut [...]
Pingback :: August 31, 2007 @ 9:03 am
[...] 2. Vadim Xaprb,What about checking selectivity of indexes ? http://www.mysqlperformanceblog.com/2007/08/28/do-you-always-need-index-on-where-column/ [...]
Pingback :: October 29, 2007 @ 6:37 am
[...] Usually one of the first things I read about on how to speed up ActiveRecord is to index my columns to speed up the lookup of items. “Of course!” But could indexing too much be harmful? [...]
Pingback :: April 29, 2008 @ 10:00 am
Very interesting website. Keep up the outstanding work and thank you…
Comment :: May 23, 2008 @ 1:59 am
What are the possible alternatives to this particular example? When skipping index is faster but not the best? Creating a summary table? Or something else? I’d like to learn from the experts.
Comment :: July 3, 2008 @ 4:02 pm