July 29, 2014

Multi Column indexes vs Index Merge

The mistake I commonly see among MySQL users is how indexes are created. Quite commonly people just index individual columns as they are referenced in where clause thinking this is the optimal indexing strategy. For example if I would have something like AGE=18 AND STATE=’CA’ they would create 2 separate indexes on AGE and STATE columns.

The better strategy is often to have combined multi-column index on (AGE,STATE). Lets see why it is the case.

MySQL indexes are (with few exceptions) BTREE indexes – this index type is very good to be able to quickly lookup the data on any its prefix and traversing ranges between values in sorted order. For example when you query AGE=18 with single column BTREE index MySQL will dive into the index to find first matching row and when will continue scanning index in order until it runs into the value of AGE more than 18 when it stops doing so assuming there are no more matching. The RANGES such as AGE BETWEEN 18 AND 20 work similar way – MySQL just stops at different value.

The enumerated ranges such as AGE IN (18,20,30) are more complicated as this is in fact several separate index lookups.

So we spoke about how MySQL uses the index but not exactly what it gets from the index – typically (unless it is covering index) MySQL gets a “row pointer” which can be primary key value (for Innodb tables) physical file offset (for MyISAM tables) or something else. It is important internally storage engine can use that value to lookup the full row data corresponding to the given index entry.

So what choices does MySQL have if you have just 2 separate indexes ? I will either use just one of them to look up the data (and check remaining portion on WHERE clause after data is read) or it can lookup the row pointers from all indexes, intersect them and when lookup the data.

Which method works better depends on selectivity and correlation. If where clause from first column selects 5% of the rows and applying where clause on second column brings it down to 1% using intersection makes sense. If second where clause just brings it down to 4.5% it is most likely better to use single index only and simply filter out rows we do not need after lookup.

Lets look at some examples now:

I made columns i1 and i2 independent in this case each selecting about 1% rows from this table which contains about 10M rows.

As you can see MySQL picks to use combined index and indeed the query completes in less than 10ms

Now what if we pretend we only have single column indexes (by hinting optimizer to ignore combined index)

As you can see in this case MySQL does the Intersection for the index values (the process a mentioned above) and the query now runs in about 70 ms which is quite a difference.

Now lets see if using just single index and post filtering is any better:

Now this query scans a lot of rows and completed in about 290ms
So we can see index merge indeed improves performance compared to single index only but it is by far better to use multi column indexes.

But the problems with Index Merge do not stop there. It is currently rather restricted in which conditions it would do the index merge:

As soon as any of the columns has enumeration instead of equality index merge is not selected any more. Even though in this case it should be good idea as i2 in (49,50) is still rather selective.

Now lets do another test. I’ve changed the table to make columns i1 and i2 highly correlated. In fact they are now the same:

Lets see what happens in this case

Hm… Optimizer decides to use index merge in this case which is probably the worse decision it could do. Indeed the query takes 360ms Note also the estimated values of “rows” is very wrong here.

This happens because Optimizer assumes columns i1 and i2 are independent estimating selectivity for the intersection. This is as good as it can to because there is no correlation statistics available.

Now if we do not allow MySQL optimizer to use second index and hence index merge, what does it turn to ? It is not combined index but single index on another column. This is because MySQL is able to estimate number of rows it will find using both indexes and as they are about the same it picks smaller index. The query takes 290ms which is exactly what we’ve seen before.

What if we leave MySQL no choice but only to use combined index:

We can see here MySQL estimates 20% more rows to traverse, which is wrong of course – it can’t be more than if only index prefix is used. MySQL does not know it as it looks at stats from different indexes independently not trying to reconcile them some way.

Because index is longer query execution takes a bit longer – 300ms

So in this case we see index merge is chosen even though it turns out to be the worst plan. Though technically it is right plan considering the statistics MySQL had available.

It is very easy to disable index merge if you do not want it to run, however I do not know of the hint in MySQL which would allow forcing using index merge when MySQL does not think it should be used. I hope hint would be added at some point.

Finally let me mention the case when Index merge works much better than multiple column indexes. This is in case you’re using OR between the columns. In this case the combined index is useless and MySQL has an option of doing full table scan or doing the Union (instead of intersection) on values it gets from the single table.

I have reverted table to have i1 and i2 as independent columns in this case to look at more typical case:

This query takes 660ms to execute. now if we change it:

If we disable index on the second column we get a full table scan:

Note MySQL puts i1 and combined into “possible_key” while it has no way to use them for this query.

The query takes 3370ms if this plan is used.

Note the query takes about 5 times longer even though in case of full table scan about 50 times more rows are scanned. This reflects very large performance difference between full table scan and access through the index, which seems to be about 10x (in terms of cost access per row) even though it is in memory workload.

For UNION case the optimizer is more advanced and it is able to deal with ranges:

As a summary: Use multi column indexes is typically best idea if you use AND between such columns in where clause. Index merge does helps performance but it is far from performance of combined index in this case. In case you’re using OR between columns – single column indexes are required for index merge to work and combined indexes can’t be used for such queries.

P.S The tests were done in MySQL 5.4.2

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Chris says:

    I realize that it was probably just done for clarity in this example, but creation of both i1 and combined indexes is not necessary (and seems a little redundant). A combined index for columns (a,b,c) is the same as also having indexes for (a,b) and (a).

  2. peter says:

    Chris,

    Yes this is right. In reality you would not have such table structure. I created it just work with it easily by allowing/disallowing indexes instead of adding/dropping them.

  3. Alex says:

    How did you get the interesting MySQL prompt? That’d be very useful to me!

  4. Vadim says:

    Alex,

    There is prompt command in mysql client, you can change that in my.cnf or in env variable MYSQL_PS1

    i.e:

    mysql>prompt (\u@\h) [\d]>
    PROMPT set to ‘(\u@\h) [\d]>’
    (root@localhost) [test]>

    see more
    http://dev.mysql.com/doc/refman/5.1/en/mysql-commands.html
    prompt section

  5. Matthew Montgomery says:

    @peter

    In your summary, “In case you’re using OR between columns – single column indexes are required for index merge to work and combined indexes can’t be used for such queries.”

    Are you saying that if index (a) were not created, MySQL could not perform an index merge (union) with (a,b) and (b) during an OR operation? Would this condition require the creation of (a),(b) and (a,b) indexes if you need to perform AND and OR queries.

  6. peter says:

    Matt,

    Indeed if you want queries which conditions A=5 and B=5 as well as A=5 OR B=5 ran most efficiently you need indexes
    (A), (A,B) and (B) but as (A) is prefix of (A,B) it will be enough only to have indexes (A,B) and (B)

  7. Lance Li says:

    Peter,

    Did you ever do the same testing for MySQL 5.0.x instead of 5.1/5.4? As in my website, the MySQL query optimizer in different version work quite differently and 5.0.x’s is much better, and sometimes I have to add some query hints to help 5.1 using the right index. Thus, until now, although new features in innodb plugin and xtradb are quite attractive for me, but I cannot push it to our production servers .

  8. peter says:

    Lance,

    All new MySQL versions break some queries in regards to Optimizer. I is good to check on individual query basics and it is well possible there is a workaround.

    The behavior I described here should be same starting 5.0 I just used latest 5.4 to ensure I’m not speaking about something which have been long fixes in the meanwhile

    Regarding XtraDB – if you’re looking for scalability aspects a lot of the changes also available in Percona 5.0 patchset.

  9. Danny says:

    Hi Peter,

    Interesting article! I tried to use your examples on one of my already existing table. On that table I had only 2 seperate indexes, so I created a 3th wich is the combination of these 2. But this combined index seems no te be used.

    The sql I executed:
    explain SELECT * FROM myTableName where publisher_entity_id=0 AND vurl_id=0;

    The result:
    type=’ref’, posible_keys=’index_vurl_id, index_publisher_id, index_combined’, key=’index_vurl_id’

    Any idea why the combine key is not used?

  10. Lance, we’ve seen a lot of that, too. This is why we’re working hard on a new Maatkit tool, called mk-upgrade. I think that one of the side effects of creating this tool will be a quicker feedback loop for the optimizer team — if users are finding the bugs quickly, they will have a much better chance of being fixed quickly.

  11. peter says:

    Danny,

    What kind of selectivity do you have ?

    How many records match vurl_id=0 vs vurl_id=0 AND publisher_entry_id=0 ? MySQL may decide selectivity for longer index is not better enough to use longer and so larger index.

  12. Danny says:

    I only have a few records in my test table (about 20). So the combined key should work if a I have enough data in the table?

    If so, I will give it a try on some of my production tables, they have >100K records.

  13. peter says:

    Danny,

    20 records is too small of data set. But main point is selectivity – if you get much less records matching combined where clause in this case MySQL normally would use index.

  14. Danny says:

    I tried it on one of my prodoction tables (2.3 million records) and MySQL now used the combined key. Thanks

  15. Dragos says:

    I was dealing with the same problem today and got to the same conclusion by my self in the end. I wish I checked this page two hours ago :)
    Anyway, I have a burning question.

    I have a table with about 2M rows and only 2 queries:
    Query A) SELECT id, title, body FROM messages WHERE type_id = __TYPEID__ AND is_active = __IS_ACTIVE__ ORDER BY date_created DESC LIMIT __PAGE__,__OFFSET__
    and Query B) SELECT id, title, body FROM messages WHERE type_id = __TYPEID__ ORDER BY date_created DESC LIMIT __PAGE__,__OFFSET__

    The only difference between the two queries is that the second lacks the is_active flag. I have a combined index on (type_id, is_active, date_created).

    For this query: SELECT id, title, body FROM messages WHERE type_id = 2 AND is_active = 1 ORDER BY date_created DESC LIMIT 0,20; This is the explain result:
    ———————————–
    select_type: SIMPLE
    type: ref
    key: multi
    key_len: 7
    ref: const,const,const
    rows: 104
    Extra: Using where

    But for SELECT id, title, body FROM messages WHERE type_id = 2 ORDER BY date_created DESC LIMIT 0,20; it’s:
    ———————————–
    select_type: SIMPLE
    type: ref
    key: multi
    key_len: 6
    ref: const,const
    rows: 108
    Extra: Using where; Using filesort

    Now, what can I do to avoid the filesort?

  16. peter says:

    Dragos,

    Good catch. For sorting to work using index you have to have it following = exactly so if you have (A,B,C) index you will be able to use index for sort in case A=5 ORDER BY B
    or in case A=5 AND B=5 ORDER BY C but not in case A=5 ORDER BY C.

    So in your case if you want to have sorting done using index by second query you will need index (type_id,date_created)

  17. Danny says:

    I have a question about the previous post. An index (A,B,C) would work when the query looks like ..WHERE A=5 AND B=5 ORDER BY C. Would it also work when the A and B are ordered differently, thus the query looks like WHERE B=5 AND A=5 ORDER BY C ?

  18. peter says:

    Danny,

    It does not matter in which order your WHERE clause constraints are mentioned. Order of columns in the index does. Order of columns in ORDER BY clause and sort direction does.

  19. Danny says:

    Oke thank you. Few minutes ago I recieved your book ‘High Perfomance MySQL’. So I can start reading this weekend and hope all about the indexes becomes clear to me.

  20. chad says:

    I have an issue that I can’t figure out. I have a huge query that is like: where a,b,c,d,e,f pulling from a DB of nearly 5 million records. I can make a multi column index on a,b,c,d,e,f but the query is dynamic. It may search on a,b,e,f or b,d,f. The amount of indexes I would need to create for this would probably be 36 (because of the leftmost prefix rule). I know that is rather absurd. So, how can I make sure everything is indexed so searches are fast on this large query and I don’t have to make all the indexes?

  21. peter says:

    Chad,

    Indeed this is a hard choice. You may end up having a lot of indexes if you need to have a lot of different search combinations. Some of them (for example double ranges) may not be handled effectively by MySQL at all.

    One of solutions we used when indexes can’t be constructed is using Sphinx (www.sphinxsearch.com) which can be thrown on the cluster easily and which can serve queries very fast even without having perfect indexes.

  22. Daniel Blackhurst says:

    Hi,

    I have a table with over 80,000,000 rows containing GPS information latitude, longitude, date_time, vehicle_id and a few more columns.

    I can’t seem to find a good index combination for a query bringing back around 7 days of results for a vehicle. Tried (date_time,vehicle_id) and MYSQL won’t use it, (date_time) and its bringing back thousands of rows under explain. Using innodb. Inserting up to 100 per second.

    Query

    select
    g.date_time, g.speed,
    g.odometer, gl.building_number_name,
    gl.street, gl.town_city, gl.region, gl.postcode, io.high, i.id as is_idle, m.id as is_moving, o.id as over_speed,
    l.name as location_name
    from gps_positions g

    inner join geocoded_locations gl on g.geocoded_location_id = gl.id
    left join io_events io on g.id = io.gps_position_id
    left join idling_events i on g.id = i.gps_position_id
    left join moving_events m on g.id = m.gps_position_id
    left join overspeed_events o on g.id = o.gps_position_id
    left join locations l on l.id = gl.location_id
    left join ios on io.io_id = ios.id

    where
    g.date_time between ’2009-10-15 00:00:00′ and ’2009-10-15 23:59:58′
    and g.vehicle_id = 6261
    and ( (io.high is not null and ios.gpio = 8) or i.id is not null or m.id is not null or o.id is not null )
    order by g.date_time;

    Thanks, Dan

  23. Manik says:

    Hi Peter,

    I have a similar question as to what Chad asked about having multiple column keys. We need to build up queries based on columns a,b,c,d and in those queries the column ‘a’ will always be there.

    So the queries can be like

    WHERE a= AND b= AND c= AND d=
    OR
    WHERE a= AND c= AND d=
    OR
    WHERE a= AND b= AND d=

    Would you recommend one composite index like KEY ‘my_key(‘a’,’b’,’c’,’d’) which can server my purpose efficiently?

    Thanks in Advance
    Manik

  24. rulix says:

    Hi Peter,

    I have a scenario here on multiple column keys. I created an partial index for person table of 2M+ records:

    CREATE INDEX firstlastname_idx ON person(firstname(2), lastname(2));

    The requirement is for wildcard searching on the first character on the firstname and lastname.

    select firstname, lastname from person where firstname like ‘p%’ and lastname like ‘s%’;

    It’s working fine but I also notice that when the server is idle for some time and then execute a query it takes more than a minute to finished the query. All subsequent query is a snap. So the problem is on the first query execution.

    I do run explain and it using the partial index. I know there is a issue about mysql on warming up but I haven’t found any concrete solution/advice to this.

    The issue in my application is I am getting “504 Gateway Time-out” error because of this slow first query.

    Any thoughts about myqsl warm up? and how to improve it.

    Thanks

    Rulix

  25. wenhai says:

    Hi Peter,

    For the same query ” EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;” , why there is different explain result after execute query “UPDATE idxtest SET i2=i1;”.
    I was confused.

    1.
    mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;
    2.
    +—-+————-+———+——+—————-+———-+———+————-+——+——-+
    3.
    | id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
    4.
    +—-+————-+———+——+—————-+———-+———+————-+——+——-+
    5.
    | 1 | SIMPLE | idxtest | ref | i1,i2,combined | combined | 8 | const,const | 665 | |
    6.
    +—-+————-+———+——+—————-+———-+———+————-+——+——-+
    7.
    1 row IN SET (0.00 sec)

    1.
    mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;
    2.
    +—-+————-+———+————-+—————-+——-+———+——+——+————————————-+
    3.
    | id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
    4.
    +—-+————-+———+————-+—————-+——-+———+——+——+————————————-+
    5.
    | 1 | SIMPLE | idxtest | index_merge | i1,i2,combined | i2,i1 | 4,4 | NULL | 959 | USING intersect(i2,i1); USING WHERE |
    6.
    +—-+————-+———+————-+—————-+——-+———+——+——+————————————-+
    7.
    1 row IN SET (0.00 sec)

  26. wenhai says:

    Hi peter,

    For the same query”EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;”, why the explain result is different after you set i1=i2.
    I think mysql should still choose multi col index not index merge.

  27. Claudiu Filip says:

    Hi peter,

    I have the following problem:

    PLEASE EXPLAIN SELECT ‘guys commenting above that they need to pay for support, because you are not fed by International Red Cross database department and the cheapest way to pay is buying your book. If that is too much for them, they should at least buy some bananas, rent some monkeys and start sorting their 20M rows table on plain paper.’;

  28. Ricardo says:

    I dont think the conclusion of this post is right. I have a myisam table with a primary key spanning 5 columns. I do a select using a WHERE on every of those 5 columns ANDed. Using the primary key (multicolumn index) it takes 25s, using a single index in one of the columns it takes 1 sec. I did a profiling and most of the 25s is taken in “Sending data” stage. The primary key has cardinality of about 7M and the single column about 80. Am i missing somehting?

  29. Julio says:

    Hi Peter. Your Article its too interesting.

    I have a question, i know that index or many index on a table improve high perfomance in select queries, but worsen perfomance in Insert, Update and Delete queries. Is it true?
    And if is it, how could i balance the choice??? too many indexes and high performance select or high perfomance update with few indexes??? how balance it???

    I wait for a your answer!

    Thank u.

    All regard

  30. Muhammad Omair says:

    HI Peter

    I had a question regarding index merge (intersect). Below are two explains that I executed for the same SQL query with a couple of mins gap on the same mysql prod instance . The first SQL query executed in 0.03 secs whereas the second SQL query when using index merge executes in 3.8 secs. I dont understand why MySQL uses index merge after a couple of mins when previously it was using a single index. I would greatly appreciate your answer.

    First explain
    —————–
    explain SELECT LinksTbl2.id, LinksTbl2.title, LinksTbl2.url, PersonTbl2.firstname, PersonTbl2.lastname, LinksTbl2.prjid, LinksTbl2.tstamp, ProjectTbl2.title, LinksTbl2.treeid, FileTbl.g_access, FileTbl.userid, FileTbl.folder_desc, LinksTbl2.description, LinksTbl2.userid, FileTbl.customize_access, FileTbl.fromdate, FileTbl.todate,LinksTbl2.main_version_id,LinksTbl2.major_version,LinksTbl2.minor_version, LinksTbl2.target, LinksTbl2.item_desc FROM LinksTbl2, ProjectTbl2, FileTbl, PersonTbl2 WHERE (LinksTbl2.prjid = ProjectTbl2.id) AND LinksTbl2.prjid=1614994645 AND (ProjectTbl2.namespace = 0) AND (LinksTbl2.tstamp > ’1333967623′) AND (FileTbl.id=LinksTbl2.treeid) AND (PersonTbl2.id=LinksTbl2.userid) AND (LinksTbl2.deactive != 1) AND (LinksTbl2.item_desc =1) AND (LinksTbl2.hidden_element!=1) AND (FileTbl.hidden_element!=1) AND (FileTbl.folder_desc ’1333967623′) AND (FileTbl.id=LinksTbl2.treeid) AND (PersonTbl2.id=LinksTbl2.userid) AND (LinksTbl2.deactive != 1) AND (LinksTbl2.item_desc =1) AND (LinksTbl2.hidden_element!=1) AND (FileTbl.hidden_element!=1) AND (FileTbl.folder_desc<2) AND (FileTbl.deactive!=1) AND (LinksTbl2.converted_from_id IS NULL OR LinksTbl2.converted_from_id=0) ORDER BY LinksTbl2.tstamp DESC, LinksTbl2.id DESC LIMIT 505 ;
    +—-+————-+————-+————-+————————————————————–+——————–+———+———————————-+——+————————————————–+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+————-+————-+————————————————————–+——————–+———+———————————-+——+————————————————–+
    | 1 | SIMPLE | ProjectTbl2 | const | PRIMARY,namespace_ix | PRIMARY | 4 | const | 1 | Using filesort |
    | 1 | SIMPLE | LinksTbl2 | index_merge | tstamp,userid,converted_from_id,item_desc,treeid_ix,prjid_ix | prjid_ix,item_desc | 4,1 | NULL | 294 | Using intersect(prjid_ix,item_desc); Using where |
    | 1 | SIMPLE | PersonTbl2 | eq_ref | PRIMARY | PRIMARY | 4 | oslogs_jan02_db.LinksTbl2.userid | 1 | |
    | 1 | SIMPLE | FileTbl | eq_ref | PRIMARY,folder_desc_ix | PRIMARY | 4 | oslogs_jan02_db.LinksTbl2.treeid | 1 | Using where |
    +—-+————-+————-+————-+————————————————————–+——————–+———+———————————-+——+————————————————–+
    4 rows in set (0.00 sec)

  31. J. Ketting says:

    Hi,

    I have a table with two text fields, both having an index of length = 4.

    SELECT COUNT(*) cnt FROM mytable WHERE field1 = ‘blue’;

    shows me 4 records and

    SELECT COUNT(*) cnt FROM mytable WHERE field2 = ‘blue’;

    shows me another 4 records (different result)
    But:

    SELECT COUNT(*) cnt FROM mytable WHERE field1 = ‘blue’ OR field2 = ‘blue’;

    shows me only 5 records! Why???
    When I use DESCRIBE SELECT … it tells me the following:

    id select_type table type possible_keys key key_len ref rows Extra
    1 SIMPLE mytable index_merge field1,field2 field1,field2 6,6 NULL 495 Using sort_union(field1,field2); Using where

    How is this possible? Why can’t I see all 8 records?

Speak Your Mind

*