Choosing Column Order in IndexesI wanted to share a little rule of thumb I sometimes use to choosing column order in indexes. This is not specific to MySQL, it’s generally applicable to any database server with b-tree indexes. And there are a bunch of subtleties, but I will also ignore those for the sake of simplicity.

Choosing Column Order

Let’s start with this query, which returns zero rows but does a full table scan. EXPLAIN says there are no possible_keys.

Don’t try to figure out the meaning of the query, because that’ll add complexity to the example 😉 In the simplest case, we want to put the most selective column first in the index, so that the number of possible matching rows is the smallest, i.e. we find the rows as quickly as possible. Assuming that all the columns have an even distribution of values, we can just count the number of matching rows for each criterion.

This is pretty simple — all I did was wrap each clause in a SUM() function, which in MySQL is equivalent to COUNT(number_of_times_this_is_true). It looks like the most selective criterion is “status=waiting”. Let’s put that column first in the index. Now, pull it out of the SELECT list and put it into the WHERE clause, and run the query again to get numbers within the subset of rows that match:

So we’re down to a reasonable number of rows (the count() is changing because I’m running this on live data, by the way). It looks like the ‘source’ is no more selective, that is, it won’t filter out any more rows within this set. So adding it to the index would not be useful. We can filter this set further by either the ‘no_send_before’ or the ‘tries’ column. Doing so on either will reduce the count of matches for the other to zero:

That means we can add an index on either of (status, tries) or (status,no_send_before) and we will find the zero rows pretty efficiently. Which is better depends on what this table is really used for, which is a question I’m avoiding.

18 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Cary Millsap

Baron, if you haven’t seen “Relational Database Index Design and the Optimizers” by Tapio Lahdenmäki and Michael Leach, you should find a copy.

—Cary

Nicolae Namolovan

That’s a cool technique, thanks for sharing 😉

Roman Petrichev

Live data can be changed dramatically within few minutes and previously looked efficient compound index can easily turn quite inefficient for the same query but for changed data set. So such technique is more suitable for immutable tables IMHO.

Rob Wultsch

I have done similar things in the past though I have mostly looked at cardinality as I assumed this would be a better general use case. (note: I think this is a foolish assumption when most queries are looking for the same values…)

I have wondered if there should be some account of the size of the columns in place. For example lets say a int is slightly less selective than a varchar(50). I would think that the smaller datatype should be somewhat preferred as a smaller index would be created and I imagine traversed more quickly.

Also I think worth note would be the effect of range type restrictions for user that are unfamiliar with their effects on queries that might make use of composite indexes.

Any thoughts?

Anthony Linton

Thanks the post, a great help. A lot of the ones recently have seemed to be at a more complex level, which isn’t so useful to me =)

Peter Zaitsev

Baron,

I think it also makes sense to look into things in the cost/benefit way. Extending the index makes it larger. In your case adding extra column which halves amount of rows to be traversed for the query probably makes sense. In some other cases when it is only 10% gain it does not make sense.

Another thing which often make sense to consider is covering index potential in particular when trying to target several query patterns with single indexes.

Covering index slows down queries which can’t use it as covering index because it is longer but speeds up queries which can use covering index dramatically in particular in IO bound scenarios.

ibrahim oguz

is it possible that using 2 server one for inserting data and the other one for reporting(making select queries) first server replicate to second server but in first one there is no index but in second server there are indexes.
because indexes make slow when you inserting,updating or deleting data but it makes fast when you select data.
how can i reorganize index when replicate data.

thanks

Stefan Stubbe

Wouldn’t it be better to check the cardinality of the columns instead of the selectivity of specific values ?
When you want to select the rows with another STATUS, another column order in the index might be better.

Rob Wultsch

For whatever it is worth I regret my comment. It is easy to be a critic (which is a hair away from what I did of asking questions I knew were hard to answer).

This technique is very good for beginners and a very good option for dealing with an unfamiliar dbms. One way or another seat time and understanding whatever optimizer you are dealing with can not be easily replaced with a single blog post. Understanding the optimizer is really key and that can not be easily summarized.

It is a pipe dream for many (most?) databases to see a process list that evenly distributes queries to hit all portions of tables evenly. Much more likely is getting hit with many queries of the same form (possibly exactly the same) on a table that changes just enough for query cache to be a bad idea. In this case the above technique with some knowledge of nuances is I think very useful…

Thank you for your time.

John Larsen

I ran into an interesting multi key issue last week, wondering if anyone had any suggestions.
Table sent_items looked like
id int(10) autoincrement unsigned
sender_id int(11) unsigned
receiver_id int(11) unsigned
item_id tinyint(3)
sent_time int(11) (stored unix timestamp, stamped by application not server)
in InnoDB
This issue started popping up somewhere around 20 million rows in the table, possibly earlier though we didn’t notice.
There was a single key on receiver_id and a combo key on (sender_id, sent_time) due to a frequent query
to get the 50 most recently received items.
Average use had about a dozen items sent.
select * from sent_items where sender_id = ‘12345678’ limit 50;
would give 12 rows in 0.5 seconds under heavy server load.
select * from sent_items where sender_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
would give the same 12 rows in 5 full seconds under heavy load.
Both queries would be nearly instant under light server load.
A query on
select * from sent_items where receiver_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
With or without ORDER would be 0.5 seconds under load.
Now the sender_id has a very different cardinality than the receiver_id and records are generally grouped by a sender as a person can send many items to a number of different receivers in an instant.
Going to try changing the multi_key to a single key later, but very strange results. Wondering if you have any ideas?

Nikolay

I knew this but I love the way you describe it.

Dawn Green

What a fantastic article worthy of my retweet. Thank you!

praji venu

i thought even the compound indexes are single btree. so the column order in the index should not matter if all columns of the index are present in the where clause…

Yzmir Ramirez

In Cary Millsap comment above he recommended “Relational Database Index Design and the Optimizers” by Tapio Lahdenm and Michael Leach, but that was back in 2009. Is that book still good to reference?

CHINMAY

Thanks!