April 23, 2014

How adding another table to JOIN can improve performance ?

JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions :)

The one I’m speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.

I have described couple of solutions to this problem – using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.

Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users …. etc)

As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.

Note this is also interesting case of EXPLAIN being wrong – it shows key_len=7 which corresponds to the full key while only first key part is used.

Let us now replace the range with IN list in this query:

So we get same result but approximately 50 times faster. In this report we had just one month worth of data – what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).

Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:

As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.
It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on “info” table instead nested loops join – exactly what we tried to avoid.

It is easy to block equality propagation by using some trivial function:

So we stopped equality propagation but now have another problem – for some reason MySQL decides to only do “ref” on the date only instead of using range on day and list of groups for each join iteration.
This does not make sense but this is how it is.

I also tried to increase cardinality by having all rows to have different group_id and it still does not work.

The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)

For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.

So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team :)

UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas :)

At the same time I figured out how to make MySQL Optimizer to do what we want to do – Just add yet another table to the join so the info table just has bunch of ref lookups:

This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. ryan says:

    Saying that “it was designed that way” is a bit of a cop-out, isn’t it? You’ve clearly established that the optimizer doesn’t properly optimize between/ranges or in/lists compared with how it can optimize table joins.

  2. Scott says:

    How about this for a schema hack:
    Modify the table to have two date columns, d1 and d2
    Make the software insert the same date into each column

    The query where d1 > lowerBound and d2 < upperBound

  3. peter says:

    Ryan,

    With optimizer we always find issues. Some of them are bugs others are just limitations in logic. In this case it is exactly this case – Optimizer could consider and use such plans but it does not do it, as it can’t do hash join, loose index scan etc.

  4. peter says:

    Scott,

    Why would you expect having ranges on different columns to help in this case ?

  5. Scott says:

    Maybe I misunderstood the nature of the problem here – is it not just between clauses, but any where clause that isn’t direct equality or IN ?

  6. peter says:

    Right. It does not only applies to between < , > complex combinations are all have the same issue.

    As you can see the trick is – if we have AND for different index part we can put selections on different tables instead and just use EQ join on the main fact table.

  7. Rob says:

    I go to “a whole ‘nother level” with this. I include a table in my databases that includes :

    dt date
    yy smallint # YEAR()
    qq tinyint # QUARTER()
    yyqq int #
    mm tinyint #
    dd tinyint
    wknum1 tinyint # WEEK()
    yywknum1 int # YEARWEEK()
    wknum2 tinyint # some things define the first wk of the yr as the first wk with Sunday in that yr,
    # so you sometimes end up with 2 definitions for what week of the year it is.
    yywknum2 int # yy * 1000 + wknum2
    dow tinyint # DAYOFWEEK()
    dom tinyint # DAYOFMONTH()
    doy tinyint # DAYOFYEAR()
    yymmdd int
    # flag fields
    wkend tinyint # handy, but not necessary –
    # makes queries more natural language
    # select gross_rev from rev, days where rev.dt = days.dt and days.wkend = 1 and days.yy = 2008
    # no where clause function calls, OMG!
    fdom tinyint # first day of the month
    ldom tinyint # last day of the month
    fwdom tinyint # first work day of the month
    lwdom tinyint # last work day of the month
    fwdoy tinyint # first work day of the yr
    lwdoy tinyint # last work day of the yr
    usholiday tinyint # is this a US holiday?wh
    rain tinyint # did it rain on this day?
    snow tinyint # did it snow on this day?
    over85 tinyint # was it over 85F on this day?
    fire tinyint # were there wild fires in the area on this day?

    I spin up records for as much of the past as needed and for the next 10 – 50 years.

    This gives you the ability to quickly do reporting that includes very complex date requirements:
    show me the coffee shop gross revenues for Q1 of the past 3 years for the first workday of each month during those periods.

    For business that have business that changes substantially based on certain events, you can have flag fields for those. Screamin’ Mimi’s Ice Cream might want to have a report comparing Summer Saturday revenue on day hotter than 85F vs days cooler than that. Outdoor restaurants, boat rentals, downtown store that caters to the work day crowds, etc.

    For other queries, having an easy way to check if something is the last day of the month is a blessing. It is massively useful and I can’t imagine design a system without it.

  8. peter says:

    Rob,

    Sure. This is a bit different though. You are adding whole another dimension tables for things which can’t be checked efficiently using index in original table (such as what day of the week it is) or which are not in the table (such as if it rained) – in this case you’ve got to add columns to fact table or dimension table as you did. That is a great thing to do.

    My main point was a bit different one would be perfectly find to check range on date and group_id from the index on the table but MySQL Optimizer/Execution engine just not capable of doing so yet.

  9. Ruslan Zakirov says:

    Very surprised to see this in my RSS reader right after submitting http://bugs.mysql.com/bug.php?id=38550 that covers later cases in your research.

  10. peter says:

    Ruslan,

    Why surprising ? Is it the first optimizer gotcha with MySQL you ran into ?

  11. Ruslan Zakirov says:

    Coincidence is surprising. Tired of workarounding so decided to push them into the tracker.

  12. Howard says:

    Hi I’m curious, wouldn’t the performance of the query become unacceptable again with any manner of sorting (resulting in filesort/temp)? You would almost certain sort from the info table rather than the joined date table.

Speak Your Mind

*