Do you always need index on WHERE column ?

I believe we wrote about this before, but this topic popups again and again.
Today I’ve read opinion that if we have clause WHERE has_something=1 we should have index on column has_something (the column has two values 0 and 1).

In reality the right answer is not so simple.

Let’s look next table

CREATE TABLE `testr` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `name` varchar(32) NOT NULL,
  `has_something` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `has_something` (`has_something`)
) ENGINE=MyISAM

CREATE TABLE `testr` (

`id` int(10) unsigned NOT NULL auto_increment,

`name` varchar(32) NOT NULL,

`has_something` tinyint(3) unsigned NOT NULL,

PRIMARY KEY (`id`),

KEY `has_something` (`has_something`)

) ENGINE=MyISAM

with 20.000.000 records.

And in first case has_something=0 for 90% of rows (with random distribution)

mysql> select cnt0/cnt from (select count(*) cnt0 from testr where has_something=0) t, (select count(*) cnt from testr) t1;
+----------+
| cnt0/cnt |
+----------+
|   0.9001 | 
+----------+
1 row in set (7.56 sec)

mysql> select cnt0/cnt from (select count(*) cnt0 from testr where has_something=0) t, (select count(*) cnt from testr) t1;

+----------+

| cnt0/cnt |

+----------+

| 0.9001 |

+----------+

1 row in set (7.56 sec)

Let’s check execution time with and without index

mysql>  select count(name) from testr force key (has_something) where has_something=0;        
+-------------+
| count(name) |
+-------------+
|    18001245 | 
+-------------+
1 row in set (35.96 sec)

mysql> select count(name) from testr force key (has_something) where has_something=0;

+-------------+

| count(name) |

+-------------+

| 18001245 |

+-------------+

1 row in set (35.96 sec)

mysql>  select count(name) from testr ignore key (has_something) where has_something=0;       
+-------------+
| count(name) |
+-------------+
|    18001245 | 
+-------------+
1 row in set (10.46 sec)

mysql> select count(name) from testr ignore key (has_something) where has_something=0;

+-------------+

| count(name) |

+-------------+

| 18001245 |

+-------------+

1 row in set (10.46 sec)

As you see with index the time is by 3.5 times slower.

Good that mysql in this case choose do not use index

mysql> explain select count(name) from testr  where has_something=0;                                 
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | testr | ALL  | has_something | NULL | NULL    | NULL | 15000000 | Using where | 
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
1 row in set (0.00 sec)

mysql> explain select count(name) from testr where has_something=0;

+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+

+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+

+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+

1 row in set (0.00 sec)

Let look the case when has_something = 0 for 50% of rows.

mysql>  select count(name) from testr force key (has_something) where has_something=0;
1 row in set (20.27 sec)

1 2	mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (20.27 sec)

mysql>  select count(name) from testr ignore key (has_something) where has_something=0;                            
1 row in set (10.62 sec)

1 2	mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.62 sec)

query with index is still 2 times slower.

and this time mysql is going to use index in execution plan:

mysql> explain select count(name) from testr  where has_something=0;                                               +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key           | key_len | ref   | rows    | Extra |
+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
|  1 | SIMPLE      | testr | ref  | has_something | has_something | 1       | const | 8890716 |       |
+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+
1 row in set (0.00 sec)

mysql> explain select count(name) from testr where has_something=0; +----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+

+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+

| 1 | SIMPLE | testr | ref | has_something | has_something | 1 | const | 8890716 | |

+----+-------------+-------+------+---------------+---------------+---------+-------+---------+-------+

1 row in set (0.00 sec)

What about 30% rows with has_something=0 ?

mysql>  select count(name) from testr force key (has_something) where has_something=0;
1 row in set (12.36 sec)

1 2	mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (12.36 sec)

mysql>  select count(name) from testr ignore key (has_something) where has_something=0;                            
1 row in set (10.51 sec)

1 2	mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.51 sec)

Still query without index is faster.

And finally for case with 20% rows with has_someting=0

mysql>  select count(name) from testr force key (has_something) where has_something=0;
1 row in set (8.39 sec)

1 2	mysql> select count(name) from testr force key (has_something) where has_something=0; 1 row in set (8.39 sec)

mysql>  select count(name) from testr ignore key (has_something) where has_something=0;                            
1 row in set (10.43 sec)

1 2	mysql> select count(name) from testr ignore key (has_something) where has_something=0; 1 row in set (10.43 sec)

So only in the last case we really need the index on column has_something

13 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Christoph

16 years ago

Very surprising for me! Can you explain the result?

Vadim Tkachenko

Author

16 years ago

Christoph,

The reason why index scan is slower than table scan is that in case of the index scan the MySQL needs to perform more operations like :
read index, get pointer to row, get data from row,
that adds overhead to the operations.

In case of table scan MySQL just goes trough the table and read all rows in continuous mode.

When count of scanned rows more than 20% (in our example) the overhead of index scan makes it non-effective.
In real queries the right percent of rows depends on many factors, but for random uniformly distributed values I would say it is in 20-30% interval.

Peter Zaitsev

Admin

16 years ago

What do you find surprising ?

Full table scan takes constant time no matter the cardinality, index traversing speed depends on the index cardinality.

Note in this case it is not pure index scan (“Using index” in explan) but access of the large portion of the rows in the table in non sequential order.

The 3.5 difference we see here matches given case – when data fits in memory so we’re mainly measuring CPU overhead if we would get random disk IO performance difference can be as large as 100 or even 1000 times !

There is other interesting issue – note name is declared NOT NULL so count(name) should be same as count(*) and so covering index could be used while MySQL is not smart enough for this.

pcasey

16 years ago

Does anyone know if there are plans to include index statistics in the optimizer decisions in the future? Seems like this is a clear case where an optimizer with index cardinality stats should have recognized that the index wasn’t helpful and done a full table scan instead.

Frank Mashraqi

16 years ago

Thanks for covering this Peter. This is one area where many people get confused and stay confused.

Indexes are of no use if selectivity is down (which happens when there are a few unique values for the column. Anyone confused on this should read more on selectivity.

Frank

Christoph

16 years ago

Thanks a lot for the detailed answers! Now it’s much clearer for me.

Arnold Daniels

16 years ago

There is a sniplet on MySQL forge to find these useless/harmfull indexes. http://forge.mysql.com/snippets/view.php?id=85

Peter Zaitsev

Admin

16 years ago

Thanks Mark,

Even thought this is post by Vadim not me 🙂

The other important thing to remember about selectivity – it is actually selectivity of value what matters. It is well possible to have column with 2 values indexes if one of them is in 99% of values and other is in 1% and have index well used to retrieve the data for that 1%

Too bad MySQL does not have partial indexes though (I mean so only that 1% could really be indexed)

Johhny

15 years ago

Very interesting website. Keep up the outstanding work and thank you…

Son Nguyen

15 years ago

What are the possible alternatives to this particular example? When skipping index is faster but not the best? Creating a summary table? Or something else? I’d like to learn from the experts.

Sergio

15 years ago

That’s why my queries took some time (with & without index)..

I have a tinint column that takes only 1 and 0 values.
There are about 164000+ rows. Almost all have 0 value, just some of the rows have 1 (in general < 10).

To be more specific, the column name is “file_hidden”. If the value is 1, then that row is not shown to public..

SELECT * FROM files WHERE file_hidden = 0

I guess the only solution in this situation is to use two tables with the same structure.. one for “1” (hidden from public), one for “0” (visible to public) rows

RNT

15 years ago

Thanks for sharing.

R&T SOlutions

Nognir

11 years ago

Couldn’t replicate your results on my machine (MySQL 5.1.61). Have a modest DB with about 1M records. I have a table in which one column points to whether the line is_deleted or not. Usually the is_deleted value is 0 and I index it. count(*) with ignore index takes abt 0.16 s. With an index, searching for 0 takes abt .13 s and for 1 takes abt 0.0s. If I increase the number of 1s to abt 50%, the time taken to count 1s and 0s is approx 0.6-0.7s. So indexing is actually faster in all cases.
Maybe it is the size of DB

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Do you always need index on WHERE column ?

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Mastering Database Monitoring: Running PMM in High Availability Mode

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Do you always need index on WHERE column ?

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Mastering Database Monitoring: Running PMM in High Availability Mode

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation