I believe we wrote about this before, but this topic popups again and again.
Today I’ve read opinion that if we have clause WHERE has_something=1 we should have index on column has_something (the column has two values 0 and 1).

In reality the right answer is not so simple.

Let’s look next table

with 20.000.000 records.

And in first case has_something=0 for 90% of rows (with random distribution)

Let’s check execution time with and without index

As you see with index the time is by 3.5 times slower.

Good that mysql in this case choose do not use index

Let look the case when has_something = 0 for 50% of rows.

query with index is still 2 times slower.

and this time mysql is going to use index in execution plan:

What about 30% rows with has_something=0 ?

Still query without index is faster.

And finally for case with 20% rows with has_someting=0

So only in the last case we really need the index on column has_something

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Christoph

Very surprising for me! Can you explain the result?

Peter Zaitsev

What do you find surprising ?

Full table scan takes constant time no matter the cardinality, index traversing speed depends on the index cardinality.

Note in this case it is not pure index scan (“Using index” in explan) but access of the large portion of the rows in the table in non sequential order.

The 3.5 difference we see here matches given case – when data fits in memory so we’re mainly measuring CPU overhead if we would get random disk IO performance difference can be as large as 100 or even 1000 times !

There is other interesting issue – note name is declared NOT NULL so count(name) should be same as count(*) and so covering index could be used while MySQL is not smart enough for this.

pcasey

Does anyone know if there are plans to include index statistics in the optimizer decisions in the future? Seems like this is a clear case where an optimizer with index cardinality stats should have recognized that the index wasn’t helpful and done a full table scan instead.

Frank Mashraqi

Thanks for covering this Peter. This is one area where many people get confused and stay confused.

Indexes are of no use if selectivity is down (which happens when there are a few unique values for the column. Anyone confused on this should read more on selectivity.

Frank

Christoph

Thanks a lot for the detailed answers! Now it’s much clearer for me.

Arnold Daniels

There is a sniplet on MySQL forge to find these useless/harmfull indexes. http://forge.mysql.com/snippets/view.php?id=85

Peter Zaitsev

Thanks Mark,

Even thought this is post by Vadim not me 🙂

The other important thing to remember about selectivity – it is actually selectivity of value what matters. It is well possible to have column with 2 values indexes if one of them is in 99% of values and other is in 1% and have index well used to retrieve the data for that 1%

Too bad MySQL does not have partial indexes though (I mean so only that 1% could really be indexed)

Johhny

Very interesting website. Keep up the outstanding work and thank you…

Son Nguyen

What are the possible alternatives to this particular example? When skipping index is faster but not the best? Creating a summary table? Or something else? I’d like to learn from the experts.

Sergio

That’s why my queries took some time (with & without index)..

I have a tinint column that takes only 1 and 0 values.
There are about 164000+ rows. Almost all have 0 value, just some of the rows have 1 (in general < 10).

To be more specific, the column name is “file_hidden”. If the value is 1, then that row is not shown to public..

SELECT * FROM files WHERE file_hidden = 0

I guess the only solution in this situation is to use two tables with the same structure.. one for “1” (hidden from public), one for “0” (visible to public) rows

RNT

Thanks for sharing.

R&T SOlutions

Nognir

Couldn’t replicate your results on my machine (MySQL 5.1.61). Have a modest DB with about 1M records. I have a table in which one column points to whether the line is_deleted or not. Usually the is_deleted value is 0 and I index it. count(*) with ignore index takes abt 0.16 s. With an index, searching for 0 takes abt .13 s and for 1 takes abt 0.0s. If I increase the number of 1s to abt 50%, the time taken to count 1s and 0s is approx 0.6-0.7s. So indexing is actually faster in all cases.
Maybe it is the size of DB