About year ago Peter wrote about redundant indexes and mentioned sometimes it is good to leave two indexes, even one is first part of another. I’m speaking about BTREE indexes, for example, KEY (A), and KEY (A,B). From SQL point of view KEY(A) is not needed, as for queries like WHERE A=5 the index (A,B) also can be used.

But there is case when for performance it would be good to have both

Let we have the table

with 1,000,000 rows, and for each state_id there are about 20,000 records.

In the benchmark the query (Q1)

which uses KEY (state_id), shows the result 115 queries/sec

And now we have the task to execute the query (Q2)

which extracts additional info – city and address, both long strings columns

In the benchmark for this query we have 10 queries per sec.

Usuall technique we use for such queries is covering index, which store all needed columns in the index, and there is no need to read row data from the table.

So – we can extend index state_id_idx (state_id) by two columns:

And now for Q2 we have 16.34 q/s, that is almost +100% comparing with non-convering index.

But if run the test for Q1 we have: 25.40 q/s, that in 4 times slower than in first run.
(This is for MyISAM, for InnoDB difference will be less. MyISAM uses key-compression for VARCHAR columns, so makes things worse).

So to have good performance for both case we should leave both indexes:
KEY state_id (state_id) and
KEY state_id_2 (state_id, city, address)

The drawback is worse INSERT performance, here is time to INSERT 1million records:
only with state_id KEY – 72 sec
with state_id and state_id_2 KEYs – 470sec
(Here I limited available memory to fit only one index, that shows how can degrade performance if we balance on memory limits)

If you are interested in, the numbers for InnoDB in the same tests:
only state_id KEY:
Q1 – 108 q/s
Q2 – 12.12 q/s
only state_id_2 KEY:
Q1 – 100 q/s
Q2 – 24.08 q/s

And INSERT of 1mil records:
only with state_id KEY – 80 sec
with state_id and state_id_2 KEYs – 136sec
(but in this case the memory was enough for both indexes)

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Xaprb

For those who want to find duplicate indexes, check out MySQL Toolkit: http://mysqltoolkit.sourceforge.net/

Ryan

The main reason I keep around duplicate indexes is because the merge intersect will not use just the first key part of an index. But this also has the disadvantage that it will often use both indexes in the intersect which is a bug in my opinion.

Peter Zaitsev

Vadim,

Actually speaking about selectivity I’d like to see distribution rather than pure selectivity.
If index has 2 values and one of them is in 99% and other in 1% it well may be index is a good idea to look up for second value – status field is typical type of such field.

Peter Zaitsev

Ryan,

Are you speaking about duplicate indexes or redundant (as vadim is speaking) ?
What problem do you mean ?

Ryan

I mean redundant indexes. Say one with ‘cola’ and another with ‘(cola, colb)’. The merge intersect will not work in a query that doesn’t use ‘colb’ so you need to add both indexes if you want to use the merge intersect. Also I don’t want to optimize by adding muti-key indexes to every possible combinations of filters a user might have and would rather use the merge intersect for many things. Then this gets the the secondary problem that if you have two redundant indexes and you are filtering on ‘cola’ and ‘colb’ it will often choose to use both indexes with merge intersect which is completly a waste of time as only the muti-key index should be used and not the single ‘cola’ index.

bob

I would be very upset if someone wanted to index on (state,city,address). Address is too big to be useful in an index (it will limit the # records/page and just doesn’t make sense.

It would be interesting to repeat the tests with just (state,city).

This of course ignores all international issues which should impose an index of (country_id,state,city).

note that this is also messy since it makes for a very deep index as opposed to a wide index (i.e. you have to traverse more btree nodes to get to your data).

Also the inserts should take longer, you’re building a much larger index in the 2nd case and causing many more page balances just due to the fact that each page will hold fewer index items.