JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions 🙂

The one I’m speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.

I have described couple of solutions to this problem – using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.

Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users …. etc)

As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.

Note this is also interesting case of EXPLAIN being wrong – it shows key_len=7 which corresponds to the full key while only first key part is used.

Let us now replace the range with IN list in this query:

So we get same result but approximately 50 times faster. In this report we had just one month worth of data – what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).

Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:

As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.
It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on “info” table instead nested loops join – exactly what we tried to avoid.

It is easy to block equality propagation by using some trivial function:

So we stopped equality propagation but now have another problem – for some reason MySQL decides to only do “ref” on the date only instead of using range on day and list of groups for each join iteration.
This does not make sense but this is how it is.

I also tried to increase cardinality by having all rows to have different group_id and it still does not work.

The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)

For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.

So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team 🙂

UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas 🙂

At the same time I figured out how to make MySQL Optimizer to do what we want to do – Just add yet another table to the join so the info table just has bunch of ref lookups:

This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ryan

Saying that “it was designed that way” is a bit of a cop-out, isn’t it? You’ve clearly established that the optimizer doesn’t properly optimize between/ranges or in/lists compared with how it can optimize table joins.

Scott

How about this for a schema hack:
Modify the table to have two date columns, d1 and d2
Make the software insert the same date into each column

The query where d1 > lowerBound and d2 < upperBound

Scott

Maybe I misunderstood the nature of the problem here – is it not just between clauses, but any where clause that isn’t direct equality or IN ?

Rob

I go to “a whole ‘nother level” with this. I include a table in my databases that includes :

dt date
yy smallint # YEAR()
qq tinyint # QUARTER()
yyqq int #
mm tinyint #
dd tinyint
wknum1 tinyint # WEEK()
yywknum1 int # YEARWEEK()
wknum2 tinyint # some things define the first wk of the yr as the first wk with Sunday in that yr,
# so you sometimes end up with 2 definitions for what week of the year it is.
yywknum2 int # yy * 1000 + wknum2
dow tinyint # DAYOFWEEK()
dom tinyint # DAYOFMONTH()
doy tinyint # DAYOFYEAR()
yymmdd int
# flag fields
wkend tinyint # handy, but not necessary –
# makes queries more natural language
# select gross_rev from rev, days where rev.dt = days.dt and days.wkend = 1 and days.yy = 2008
# no where clause function calls, OMG!
fdom tinyint # first day of the month
ldom tinyint # last day of the month
fwdom tinyint # first work day of the month
lwdom tinyint # last work day of the month
fwdoy tinyint # first work day of the yr
lwdoy tinyint # last work day of the yr
usholiday tinyint # is this a US holiday?wh
rain tinyint # did it rain on this day?
snow tinyint # did it snow on this day?
over85 tinyint # was it over 85F on this day?
fire tinyint # were there wild fires in the area on this day?

I spin up records for as much of the past as needed and for the next 10 – 50 years.

This gives you the ability to quickly do reporting that includes very complex date requirements:
show me the coffee shop gross revenues for Q1 of the past 3 years for the first workday of each month during those periods.

For business that have business that changes substantially based on certain events, you can have flag fields for those. Screamin’ Mimi’s Ice Cream might want to have a report comparing Summer Saturday revenue on day hotter than 85F vs days cooler than that. Outdoor restaurants, boat rentals, downtown store that caters to the work day crowds, etc.

For other queries, having an easy way to check if something is the last day of the month is a blessing. It is massively useful and I can’t imagine design a system without it.

Ruslan Zakirov

Very surprised to see this in my RSS reader right after submitting http://bugs.mysql.com/bug.php?id=38550 that covers later cases in your research.

Ruslan Zakirov

Coincidence is surprising. Tired of workarounding so decided to push them into the tracker.

Howard

Hi I’m curious, wouldn’t the performance of the query become unacceptable again with any manner of sorting (resulting in filesort/temp)? You would almost certain sort from the info table rather than the joined date table.