August 23, 2014

The Optimization That (Often) Isn’t: Index Merge Intersection

Prior to version 5.0, MySQL could only use one index per table in a given query without any exceptions; folks that didn’t understand this limitation would often have tables with lots of single-column indexes on columns which commonly appeared in their WHERE clauses, and they’d wonder why the EXPLAIN plan for a given SELECT would show N possible index choices but only one index actually used.

To some extent, MySQL 5.0 and later changed this situation with the introduction of the “index merge” optimizer plan. The basic idea behind index merge is that for certain types of queries which contain WHERE clauses with columns that had single-column indexes on them, MySQL could sometimes make use of the multiple indexes. For instance, “SELECT foo FROM bar WHERE indexed_colA = X OR indexed_colB = Y” might use the index merge union algorithm, which would *simultaneously* (this is important, as we will see below) scan the indexes on indexed_colA and indexed_colB and then do a set-theoretic union of the two result sets. Replace that “OR” with an “AND” and you might see the set-theoretic intersection of the result sets. Sounds like this could be a significant performance win, perhaps. Sometimes it is. Other times, it is a major performance killer.

It’s fairly straightforward to tell when MySQL is doing this; run an EXPLAIN on your SELECT and you’ll see “index_merge” as the type of query and the type of index merge algorithm in the “Extra” field. Here’s an example of a table and a query which I ran into recently that did indeed attempt to use index_merge, but with disappointing results.

The users table in question here had approximately 4.5M rows, and an EXPLAIN produced the following execution plan:

At first glance, this might not look too bad. MySQL is using three different indexes to search approximately 8100 out of 4.5M rows to return the data we want. Since we’re only requesting the user_id column, which is the PK for this table and implicitly appended to each secondary index, our merged indexes are covering for the query, so that’s also good. When we ran it, however, it took 3 seconds. Not only that, but it came up at the top of a pt-query-digest report which showed that it was the most commonly executed query on the server by two orders of magnitude. Other queries were running maybe 1000 or 2000 times, but this one?

Note the low V/M here. This query *always* takes about 3.3 seconds to execute. Yet it appears so benign. What gives?

One word: selectivity. Out of 4.5M rows in the table, 4M of them had status = 1, and 4.4M of them had parent_id = 0. Only 17K, however, matched user_type = 2. It’s worth noting here that the (parent_id, status, user_type) = (0, 1, 2) tuple chosen by pt-query-digest as the sample query wasn’t a fluke or unfortunately-chosen representative; in talking to the customer, I discovered that it’s this particular set of conditions which was *always* used by this query. We wouldn’t see, for instance, this query looking for a different status or a different user type. When we think about what this means in the context of index merge, it means that MySQL is really looking at a search space closer to 8.5M rows, not 8100, because it has to read the index entries for each of the columns in the WHERE clause that are included in this merge operation and then perform a set intersection on the results. Here’s one situation where trying to do multiple things at the same time ends up taking longer – if we stopped after retrieving the 17k rows matching user_type=2, we’d save ourselves a lot of work. Ooops.

The good news is that once identified, there are multiple ways to fix this issue, and they are all straightforward to implement. The bad (sort of) news is that each situation is going to be different, and there isn’t likely to be a one-size-fits-all silver bullet solution.

For example, one option would be to simply adjust the optimizer_switch configuration setting and disable index_merge_intersection. It’s a dynamically-adjustable setting and can be modified globally or per-session:

This is the database equivalent of kicking the door in when you’ve forgotten your keys. Yes, it’s quick, and it will get you back in the house, but now you have no door, and your cats might get out. Or your neighbors will see you walking around in your underwear. Or any of a panoply of other unintended and potentially undesirable consequences, many of which you might not see right away. Disabling this setting might speed up this query (it did), but it might also negatively impact other queries which were benefitting from this optimization (it did that, too). Effective for quick testing (e.g., to find out if index merge is actually a problem) but probably not the best way to go for a permanent solution.

Another option is, of course, to change the indexing on the table. For instance, if 4.4M out of 4.5M rows have parent_id = 0, we might be tempted to remove the index on parent_id if queries for parent_id != 0 could be satisfied in some other way. After all, if an index isn’t there, MySQL can’t attempt to use it, but this has the same challenges that are present with the previous solution. What we do in order to help one query execute more quickly may end up creating unintended negative consequences elsewhere.

We could rewrite the query as a sub-select. This will work properly, and it won’t interfere with other queries, but IMHO it’s not very elegant and it’s a bit messy:

Finally, we can use index hints. Index hints are exactly what they sound like – “help” for the optimizer to favor or disfavor a particular index or set of indexes so that the query executes in a specific fashion. Index hints are fairly unintrusive and can be applied on a query-by-query basis, so there’s no need to go mucking about with server configuration, altering the table structure, or creating messy sub-selects. In cases where the optimizer just can’t get things right, and you, as the developer or DBA, know exactly how the server should be finding and manipulating data, they can be quite effective.

In the case of this query, since we know that it’s the index on (user_type) that is the one that MySQL should be looking at, we can rewrite the query like so, with the following result:

Making this change turns what was a 3 second query into a millisecond query. It’s hard to argue with success, although in many cases I recommend against using index hints for one very simple reason: stuff happens. Just a few possibilities:

  1. A new version of MySQL is released that has some new optimizer functionality that you end up missing out on.
  2. Junior developers might see USE INDEX() and suddenly start believing that they can always outsmart the optimizer by manually manipulating indexes in every query they write (not that I have ever seen this happen or anything).
  3. The person that originally added the USE INDEX() leaves the company and take the knowledge and rationale behind it out the door.
  4. The data change so much such that the index hint is now doing more harm than good.

Caveats aside, there are some very good reasons to use index hints for this particular query, and it was ultimately the approach I suggested as a short/medium-term solution while they did some re-work on the process as a whole. When I came back the next day and we looked at the Cacti reports, the “InnoDB rows examined” graph had dropped several orders of magnitude and CPU utilization on the machine had dropped by about 2/3. Not a bad outcome at all.

And so, what do we learn here?

  • Just because the output from EXPLAIN “appears to look good” doesn’t mean that it actually is good.
  • Whenever you see index_merge_intersection, it’s a good time to make sure that the indexes involved have good selectivity (a good practice in general, to be sure).
  • As they say in the Perl community, there’s more than one way to do it. Otherwise known as beware the law of unintended consequences.
  • Sometimes we have to sacrifice “elegance” (IMHO, index hints are inelegant and hackish) to produce the best possible outcome. C’est la vie.
About Ernie Souhrada

Ernie joined Percona in April 2012 as a Senior Consultant. In his previous lives, he has been everything from a Perl/Java developer to a Linux sysadmin, a MySQL DBA to a Cisco network engineer, and a security auditor to an IT engineering manager, many of these things all at the same time. When not working on MySQL, he might be found on the ski slope, at a psytrance festival, or at the nearest sushi bar.

Comments

  1. Regina says:

    Haven’t really done too much heavy work with MySQL, but I would expect MySQL to utilize table stats to arrive at if there is enough selectivity of an index for a particular query to make it worth using.

    PostgreSQL has similar functionality to the index merge strategy it calls bitmap scan that it’s had for a while. In these cases even if there are multiple indexes it can merge, it usually does a stats analysis to determine if an index is selective enough for a filter condition to warrant the cost of using it especially in a bitmap scan. Does MySQL have similar checks?

  2. How do you get data points of InnoDB rows examined? I can’t find anything in MySQL’s status outputs that provides that on a global level; do you fetch it from the queries individually?

  3. Phil Jensen says:

    James Pearson; try:
    show global status like ‘innodb_rows_read’;

  4. Justin Swanhart says:

    James,

    InnoDB reports the # of rows examined in SHOW ENGINE INNODB STATUS. The InnoDB plugin also has status counters:
    mysql> show global status like ‘%Innodb_rows%’;
    +———————-+———-+
    | Variable_name | Value |
    +———————-+———-+
    | Innodb_rows_deleted | 159629 |
    | Innodb_rows_inserted | 438 |
    | Innodb_rows_read | 27879241 |
    | Innodb_rows_updated | 42 |
    +———————-+———-+
    4 rows in set (0.00 sec)

  5. @Regina–

    MySQL does maintain statistics, but that’s not really the issue. The problem is that MySQL doesn’t correlate any of the statistics from one index with those from another, and this ends up leading to a value for “rows_examined” which is grossly underestimated. If we go back to the original query and the conditions on those columns, we can easily see that the indexes on status and parent_id are effectively useless, particularly when compared to the third index on user_type, but MySQL isn’t able to arrive at that same conclusion.

    As I understand it, the optimizer simply sees “hey, we have these three indexes and they fit the conditions for an index_merge, so let’s do it” – and when that comes back with an estimated search space of 8100 rows, that looks a lot cheaper (remember, the optimizer is cost-based) than my index hint query which has an estimated search space of about 32K. AFAIK, index hints (or multi-column indexes, when it makes sense to use them) are really the only decent way around this issue.

  6. We can also simply create a “covered” index on (user_type, status, parent_id, user_id), right? The index will be larger but MySQL will be able to satisfy the query out of the index itself.
    One more downside of “use index” hint is this – if, in the future, someone will remove or rename the index the query will start failing.

Speak Your Mind

*