July 26, 2014

Multiple column index vs multiple indexes

(There is an updated version of the content in this post by Percona’s Stephane Combaudon available here.)
After my previous post there were questions raised about Index Merge on Multiple Indexes vs Two Column Index efficiency. I mentioned in most cases when query can use both of the ways using multiple column index would be faster but I also went ahead to do some benchmarks today.

I’m using couple of simple tables:

I have populated this table with random data for i and j having both of them having 1000 of distinct values, independent on each other. I also created couple of other tables with same data just with really low cardinality with i and j having just 3 values each. The table contained about 18M rows though was small enough to fit in the systems memory.

I’ve benchmarked simple queries using where clause which covers multiple columns:

As some of them there way too fast if run once I ran them multiple times and measured time appropriately. To remove the overhead of starting MySQL etc from equation I also measured execution of “SELECT 1″ query using same script and subtracted this time from result in the table.

time for ((i=0;i<100;i+=1)); do mysql test -e “SELECT sum(length(val)) FROM t1000idxmerge WHERE i=2 AND j=1″; done > /dev/null

In the result table I compute per query results and present results in milliseconds.

Query1000 – 2 indexes1000 – 2 columns3 – 2 indexes3 – 2 columns
Q1200.269402530
Q2250.374007500
Q3253072003830
Q470380047004700
Q5252980--

Note1: Q1 will not use Index Merge technique for low cardinality table but instead pick to do single index scan. I’m not aware of the optimizer hint which would allow to force index merge as you can do with index accesses in general.

Note2 Q2/Q3 can’t use Index Merge however as it is currently implemented so they would use single index range scan. The 2 indexes however benefits to Q3 because it can only use first keypart of index (j,i) to resolve BETWEEN part of the clause which may not be very selective.

Note3 You may be surprised why 2 column index is faster for Q3 in case of low cardinality even though MySQL can’t use index well. You’re right MySQL can’t and MySQL does not – Full table scan is performed and in this case turns to be faster than scanning 1/5th of the table using index. Also Full Table Scan is preferred for Q4 in all cases but in case of high cardinality multiple index configuration.

Note4 Q5 was just run on high cardinality tables to show what difference large BETWEEN can make.

Conclusion: For benchmarked queries we can see Multiple Column index beats Index Merge in all cases when such index can be used. It is also worth to watchout a MySQL may decide not to do Index merge (either intersection or union) but instead do full table scan or access table picking only one index on the pair.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Multi column indexes may be faster in some cases, however, you also need to balance the performance of write operations, and disk consumption.

    As a rule of thumb, the more indexes you have, the slower write operations are. DELETEs can be especially costly, under the right circumstances. Also, I’ve seen situations where indexes use more disk space than the actual table.

  2. tgabi says:

    Peter,
    very intersting. Please post a histogram of query time versus range of BETWEEN of queries Q2 and Q3/Q5. I did my tests on a table crafted on the specifications you posted but I’d like to see your data before I post my comments. I’m not interested in the case where cardinality is 3 since it’s too low to have a reliable index usage.

  3. peter says:

    Tgabi,

    I did not do the timing for histogram. You have the table information and so can do the run yourself and share results :)

  4. tgabi says:

    Very well. Here’s my data on the table with cardinality 1000 and 18M records – the query is SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i BETWEEN 1 AND x:

    x time
    ————–
    2 48.57
    3 97.82
    4 147.25
    5 197.1
    10 443.31
    15 691.06

    If goes downhill from there. So, very limited range of aplicability.
    On the other hand, I’ve got consistent results from 2 index table, surprisingly, much faster than your results. This is why I wanted to see your data. I’m not sure why my results are so much faster:

    x time
    ————–
    2 46.08
    10 46.14
    20 46.24

    Anyway, the 2 column index is only useful for a limited range of values. It can only be used as performance boost in same specific cases in addition to single column indexes.
    I’ve used a 8 GB RAM computer dual core CPU. Tables are less than 1 GB in size. The index cache was set at 4 GB. Mysql version 5.0.67 ( i was doing testing for bug 29857).

    mysql> show index from t1000;
    +——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
    | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
    +——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
    | t1000 | 1 | i | 1 | i | A | 1001 | NULL | NULL | | BTREE | |
    | t1000 | 1 | j | 1 | j | A | 1001 | NULL | NULL | | BTREE | |
    +——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
    2 rows in set (0.00 sec)

    mysql> select count(*) from t1000;
    +———-+
    | count(*) |
    +———-+
    | 18000000 |
    +———-+
    1 row in set (0.00 sec)

    (the other table, t1000c, has same data only combined index instead of simple index).

  5. tgabi says:

    Update: for whatever reason the multi-column index is buggered by BETWEEN. See below:

    mysql> explain SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i BETWEEN 1 AND 2;
    +—-+————-+——–+——-+—————+——+———+——+——-+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——–+——-+—————+——+———+——+——-+————-+
    | 1 | SIMPLE | t1000c | range | i | i | 8 | NULL | 14729 | Using where |
    +—-+————-+——–+——-+—————+——+———+——+——-+————-+
    1 row in set (0.00 sec)

    mysql> explain SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i IN (1,2);
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    | 1 | SIMPLE | t1000c | range | i | i | 8 | NULL | 33 | Using where |
    +—-+————-+——–+——-+—————+——+———+——+——+————-+
    1 row in set (0.00 sec)

    Peter, what version of mysql are you using ?

  6. peter says:

    Tgabi,

    The query you’re running is exactly query which can’t use multi column index completely – BETWEEN makes it to scan only first key part this is why with it the performance is bad. If you have query as you described the index should be (j,i) not (i,j)

    Now you also see two more things with your explain – first Index Merge is not really used here – it is just range scan by one or other index. Second there is a huge difference between IN and BETWEEN – IN allows scanning both keyparts while BETWEEN does not.

  7. tgabi says:

    Yet, in your data shows as quick. Look at Q3:

    Q3 25 0.3

    was this on a “switched” index that doesn’t show on the description ?

    Regarding BETWEEN: shouldn’t the optimized transform this ?

  8. peter says:

    tgabi,

    Ah… This is typo it should be 30ms not 0.3ms So pretty much it works as good as BETWEEN 1 and 2 query would :)
    BETWEEN is never optimized to IN even though it is often possible – you can do it manually though

  9. jocelyn says:

    What about the impact of multiple column index on ORDER BY speed ?
    e.g. :

    SELECT * FROM a WHERE b=1 AND c=2 ORDER BY d

    Here, an index on (b,c,d) will avoid MySQL using filesorting.
    If three indexes are put on b, c and d, will MySQL be able to use index merge on d to avoid filesorting ?

  10. nguyen says:

    hi all,
    i have a table:
    CREATE TABLE yellow_pages (
    bid int(11) NOT NULL auto_increment,
    bname varchar(100) character set utf8 NOT NULL,
    address varchar(50) character set utf8 default NULL,
    city varchar(30) character set utf8 default NULL,
    state varchar(2) default NULL,
    zip int(5) unsigned NOT NULL,
    country varchar(20) default NULL,
    website varchar(50) default NULL,
    phone varchar(10) default NULL,
    fax varchar(10) default NULL,
    latitude decimal(10,6) default NULL,
    longitude decimal(10,6) default NULL,
    contact varchar(50) default NULL,
    title varchar(50) default NULL,
    sic int(10) unsigned default NULL,
    sic_des varchar(100) default NULL,
    uid int(10) default NULL,
    PRIMARY KEY (bid),
    KEY latitude (latitude),
    KEY longitude (longitude),
    KEY zip (zip),
    FULLTEXT KEY bname (bname),
    FULLTEXT KEY sic_des (sic_des)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    and the query:

    select (IFNULL(ACOS(0.796641758191*COS(RADIANS(l.latitude))*(-0.525545896347*COS(RADIANS(l.longitude)) + -0.850765250132*SIN(RADIANS(l.longitude))) + 0.604451742578*SIN(RADIANS(l.latitude))), 0.00000)*6370298.85223) as dis, l.* from yellow_pages l WHERE match(sic_des) against(‘”hotel*” ‘ in boolean mode) and ( longitude BETWEEN -122.159246013 AND -121.250753987 and latitude BETWEEN 36.827530043 AND 37.551269957 or zip in(95050,95051,95052,95053,95054,95055,95056)) order by dis LIMIT 0, 10

    i use php(v5.2) to connect to mysql(v5.02) but it run sooooo long
    any idea??

    thanks

  11. praveen says:

    Hi, I am pasting my query. It takes some 1 min 22 sec to execute.
    TBL_RES_USER_SKILLS has some 5 lacks data. If I remove “AND SKILL_NAME=’JAVA’” condition from the below query then it gets result in 5 seconds. I am not able to figure out the problem.Please guide me.

    select count(*) from (select ID from ((SELECT ID FROM TBL_RES_USR) ORDER BY SCORE DESC) a where a.ID=(SELECT DISTINCT(USER_ID) FROM TBL_RES_SCORE_DATA WHERE COMPLETENESS>=80 and USER_ID=a.ID ) AND NOT((select COUNT(*) from TBL_USR_ORGANIZATION_INFO b where TO_DATE =(SELECT max(TO_DATE) from TBL_USR_ORGANIZATION_INFO where USER_ID=a.ID AND DEL_IND =’n’) AND USER_ID=a.ID AND DEL_IND =’n’ and ORGAN_NAME In(‘SAP’))) and a.ID=(select distinct(USER_ID) from TBL_RES_USER_SKILLS where USER_ID=a.ID and SKILL_NAME=’JAVA’ )and a.ID=(select distinct(USER_ID) from TBL_RES_SUMMARY where USER_ID = a.ID and TOTAL_EXPERIENCE<=100)and (a.ID=(select distinct(USER_ID) from TBL_RES_SUMMARY where (timediff(date_add((select curdate()),INTERVAL -’22:12′ HOUR_MINUTE),date_add(LAST_LOGIN,INTERVAL -’22:12′ HOUR_MINUTE))/24)<=365 and USER_ID=a.ID)))as Counts;

    +———-+
    | count(*) |
    +———-+
    | 4333 |
    +———-+
    1 row in set, 8666 warnings (1 min 22.93 sec)

  12. @praveen:
    Do an “Explain ” in both the cases and check the difference

  13. excellent!

    thanks for the post.

    I am wondering what is the best practice from performance point of view to index enumerations/chars

    If you have a lot of flags like: has_garden, has_pool, has_whatever. Its just an example from top of my head i am not actually writing property selling site ;)

    Mysql does not have bitmap indexes as far as i know and creating index with all of them does not seem like a good idea or does it? Columns order matters so if you search for house without considering first flag you are doomed.

    what is your take on that situation with millions of rows?

    Thanks : )

  14. we are working on a product.it has 93 tables,so how to optimize the performance..i am a complete fresher and i know nothing apart from the basic concepts of dbms.

Speak Your Mind

*