I believe we wrote about this before, but this topic popups again and again.
Today I’ve read opinion that if we have clause WHERE has_something=1 we should have index on column has_something
(the column has two values 0 and 1).
In reality the right answer is not so simple.
Let’s look next table
1 2 3 4 5 6 7 | CREATE TABLE `testr` ( `id` int(10) unsigned NOT NULL auto_increment, `name` varchar(32) NOT NULL, `has_something` tinyint(3) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `has_something` (`has_something`) ) ENGINE=MyISAM |
with 20.000.000 records.
And in first case has_something=0 for 90% of rows (with random distribution)
1 2 3 4 5 6 7 | mysql> select cnt0/cnt from (select count(*) cnt0 from testr where has_something=0) t, (select count(*) cnt from testr) t1; +----------+ | cnt0/cnt | +----------+ | 0.9001 | +----------+ 1 row in set (7.56 sec) |
Let’s check execution time with and without index
1 2 3 4 5 6 7 | mysql> select count(name) from testr force key (has_something) where has_something=0; +-------------+ | count(name) | +-------------+ | 18001245 | +-------------+ 1 row in set (35.96 sec) |
1 2 3 4 5 6 7 | mysql> select count(name) from testr ignore key (has_something) where has_something=0; +-------------+ | count(name) | +-------------+ | 18001245 | +-------------+ 1 row in set (10.46 sec) |
As you see with index the time is by 3.5 times slower.
Good that mysql in this case choose do not use index
1 2 3 4 5 6 7 | mysql> explain select count(name) from testr where has_something=0; +----+-------------+-------+------+---------------+------+---------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+----------+-------------+ | 1 | SIMPLE | testr | ALL | has_something | NULL | NULL | NULL | 15000000 | Using where | +----+-------------+-------+------+---------------+------+---------+------+----------+-------------+ 1 row in set (0.00 sec) |
Let look the case when has_something = 0 for 50% of rows.
1 2 | mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (20.27 sec) |
1 2 | mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.62 sec) |
query with index is still 2 times slower.
and this time mysql is going to use index in execution plan:
1 2 3 4 5 6 | mysql> explain select count(name) from testr where has_something=0; +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+ | 1 | SIMPLE | testr | ref | has_something | has_something | 1 | const | 8890716 | | +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+ 1 row in set (0.00 sec) |
What about 30% rows with has_something=0 ?
1 2 | mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (12.36 sec) |
1 2 | mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.51 sec) |
Still query without index is faster.
And finally for case with 20% rows with has_someting=0
1 2 | mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (8.39 sec) |
1 2 | mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.43 sec) |
So only in the last case we really need the index on column has_something
Very surprising for me! Can you explain the result?
Christoph,
The reason why index scan is slower than table scan is that in case of the index scan the MySQL needs to perform more operations like :
read index, get pointer to row, get data from row,
that adds overhead to the operations.
In case of table scan MySQL just goes trough the table and read all rows in continuous mode.
When count of scanned rows more than 20% (in our example) the overhead of index scan makes it non-effective.
In real queries the right percent of rows depends on many factors, but for random uniformly distributed values I would say it is in 20-30% interval.
What do you find surprising ?
Full table scan takes constant time no matter the cardinality, index traversing speed depends on the index cardinality.
Note in this case it is not pure index scan (“Using index” in explan) but access of the large portion of the rows in the table in non sequential order.
The 3.5 difference we see here matches given case – when data fits in memory so we’re mainly measuring CPU overhead if we would get random disk IO performance difference can be as large as 100 or even 1000 times !
There is other interesting issue – note name is declared NOT NULL so count(name) should be same as count(*) and so covering index could be used while MySQL is not smart enough for this.
Does anyone know if there are plans to include index statistics in the optimizer decisions in the future? Seems like this is a clear case where an optimizer with index cardinality stats should have recognized that the index wasn’t helpful and done a full table scan instead.
Thanks for covering this Peter. This is one area where many people get confused and stay confused.
Indexes are of no use if selectivity is down (which happens when there are a few unique values for the column. Anyone confused on this should read more on selectivity.
Frank
Thanks a lot for the detailed answers! Now it’s much clearer for me.
There is a sniplet on MySQL forge to find these useless/harmfull indexes. http://forge.mysql.com/snippets/view.php?id=85
Thanks Mark,
Even thought this is post by Vadim not me 🙂
The other important thing to remember about selectivity – it is actually selectivity of value what matters. It is well possible to have column with 2 values indexes if one of them is in 99% of values and other is in 1% and have index well used to retrieve the data for that 1%
Too bad MySQL does not have partial indexes though (I mean so only that 1% could really be indexed)
Very interesting website. Keep up the outstanding work and thank you…
What are the possible alternatives to this particular example? When skipping index is faster but not the best? Creating a summary table? Or something else? I’d like to learn from the experts.
That’s why my queries took some time (with & without index)..
I have a tinint column that takes only 1 and 0 values.
There are about 164000+ rows. Almost all have 0 value, just some of the rows have 1 (in general < 10).
To be more specific, the column name is “file_hidden”. If the value is 1, then that row is not shown to public..
SELECT * FROM files WHERE file_hidden = 0
I guess the only solution in this situation is to use two tables with the same structure.. one for “1” (hidden from public), one for “0” (visible to public) rows
Thanks for sharing.
R&T SOlutions
Couldn’t replicate your results on my machine (MySQL 5.1.61). Have a modest DB with about 1M records. I have a table in which one column points to whether the line is_deleted or not. Usually the is_deleted value is 0 and I index it. count(*) with ignore index takes abt 0.16 s. With an index, searching for 0 takes abt .13 s and for 1 takes abt 0.0s. If I increase the number of 1s to abt 50%, the time taken to count 1s and 0s is approx 0.6-0.7s. So indexing is actually faster in all cases.
Maybe it is the size of DB