How adding another table to JOIN can improve performance ?

JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions 🙂

The one I’m speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.

I have described couple of solutions to this problem – using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.

Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users …. etc)

CREATE TABLE `info` (

`id` int(10) unsigned NOT NULL auto_increment,

`d` date NOT NULL,

`group_id` int(10) unsigned NOT NULL,

`events` int(10) unsigned NOT NULL,

PRIMARY KEY (`id`),

KEY `d` (`d`,`group_id`)

) ENGINE=MyISAM AUTO_INCREMENT=18007591 DEFAULT CHARSET=latin1

mysql> select sum(events) from info where d between '2007-01-01' and '2007-01-31' and group_id in (10,20,30,40,50,60,70,80,90,100);

+-------------+

| sum(events) |

+-------------+

| 3289092 |

+-------------+

1 row in set (1.04 sec)

mysql> explain select sum(events) from info where d between '2007-01-01' and '2007-01-31' and group_id in (10,20,30,40,50,60,70,80,90,100) \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: info

type: range

possible_keys: d

key: d

key_len: 7

ref: NULL

rows: 355213

Extra: Using where

1 row in set (0.00 sec)

As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.

Note this is also interesting case of EXPLAIN being wrong – it shows key_len=7 which corresponds to the full key while only first key part is used.

Let us now replace the range with IN list in this query:

mysql> explain select sum(events) from info where d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31')  and group_id in (10,20,30,40,50,60,70,80,90,100) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: info
         type: range
possible_keys: d
          key: d
      key_len: 7
          ref: NULL
         rows: 3681
        Extra: Using where
1 row in set (0.01 sec)

mysql> select sum(events) from info where d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31')  and group_id in (10,20,30,40,50,60,70,80,90,100);
+-------------+
| sum(events) |
+-------------+
|     3289092 |
+-------------+
1 row in set (0.02 sec)

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: info

type: range

possible_keys: d

key: d

key_len: 7

ref: NULL

rows: 3681

Extra: Using where

1 row in set (0.01 sec)

+-------------+

| sum(events) |

+-------------+

| 3289092 |

+-------------+

1 row in set (0.02 sec)

So we get same result but approximately 50 times faster. In this report we had just one month worth of data – what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).

Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:

mysql> show create table dl \G
*************************** 1. row ***************************
       Table: dl
Create Table: CREATE TABLE `dl` (
  `myday` date NOT NULL,
  PRIMARY KEY  (`myday`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> select  * from dl limit 5;
+------------+
| myday      |
+------------+
| 2001-01-01 |
| 2001-01-02 |
| 2001-01-03 |
| 2001-01-04 |
| 2001-01-05 |
+------------+
5 rows in set (0.00 sec)

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and myday=d and group_id in (10,20,30,40,50,60,70,80,90,100) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: NULL
         rows: 30
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: info
         type: range
possible_keys: d
          key: d
      key_len: 7
          ref: NULL
         rows: 355213
        Extra: Using where
2 rows in set (0.00 sec)

mysql> show create table dl \G

*************************** 1. row ***************************

Table: dl

Create Table: CREATE TABLE `dl` (

`myday` date NOT NULL,

PRIMARY KEY (`myday`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

mysql> select * from dl limit 5;

+------------+

| myday |

+------------+

| 2001-01-01 |

| 2001-01-02 |

| 2001-01-03 |

| 2001-01-04 |

| 2001-01-05 |

+------------+

5 rows in set (0.00 sec)

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and myday=d and group_id in (10,20,30,40,50,60,70,80,90,100) \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dl

type: range

possible_keys: PRIMARY

key: PRIMARY

key_len: 3

ref: NULL

rows: 30

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: info

type: range

possible_keys: d

key: d

key_len: 7

ref: NULL

rows: 355213

Extra: Using where

2 rows in set (0.00 sec)

As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.
It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on “info” table instead nested loops join – exactly what we tried to avoid.

It is easy to block equality propagation by using some trivial function:

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and d=date(myday) and group_id in (10,20,30,40,50,60,70,80,90,100) \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dl

type: range

possible_keys: PRIMARY

key: PRIMARY

key_len: 3

ref: NULL

rows: 30

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: info

type: ref

possible_keys: d

key: d

key_len: 3

ref: func

rows: 17990

Extra: Using where

2 rows in set (0.00 sec)

So we stopped equality propagation but now have another problem – for some reason MySQL decides to only do “ref” on the date only instead of using range on day and list of groups for each join iteration.
This does not make sense but this is how it is.

I also tried to increase cardinality by having all rows to have different group_id and it still does not work.

The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and d=myday and group_id in (10) \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dl

type: range

possible_keys: PRIMARY

key: PRIMARY

key_len: 3

ref: NULL

rows: 30

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: info

type: ref

possible_keys: d

key: d

key_len: 7

ref: test.dl.myday,const

rows: 18

Extra:

2 rows in set (0.00 sec)

For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.

So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team 🙂

UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas 🙂

At the same time I figured out how to make MySQL Optimizer to do what we want to do – Just add yet another table to the join so the info table just has bunch of ref lookups:

mysql> select * from g;
+-----+
| gr  |
+-----+
|  10 |
|  20 |
|  30 |
|  40 |
|  50 |
|  60 |
|  70 |
|  80 |
|  90 |
| 100 |
+-----+
10 rows in set (0.00 sec)

mysql> EXPLAIN SELECT sum(events) FROM g,info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday=d AND group_id=g.gr \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: NULL
         rows: 30
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: g
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 10
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: info
         type: ref
possible_keys: d
          key: d
      key_len: 7
          ref: test.dl.myday,test.g.gr
         rows: 18
        Extra:
3 rows in set (0.00 sec)

mysql> select * from g;

+-----+

| gr |

+-----+

| 10 |

| 20 |

| 30 |

| 40 |

| 50 |

| 60 |

| 70 |

| 80 |

| 90 |

| 100 |

+-----+

10 rows in set (0.00 sec)

mysql> EXPLAIN SELECT sum(events) FROM g,info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday=d AND group_id=g.gr \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dl

type: range

possible_keys: PRIMARY

key: PRIMARY

key_len: 3

ref: NULL

rows: 30

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: g

type: index

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: NULL

rows: 10

Extra: Using index

*************************** 3. row ***************************

id: 1

select_type: SIMPLE

table: info

type: ref

possible_keys: d

key: d

key_len: 7

ref: test.dl.myday,test.g.gr

rows: 18

Extra:

3 rows in set (0.00 sec)

This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

ryan

15 years ago

Saying that “it was designed that way” is a bit of a cop-out, isn’t it? You’ve clearly established that the optimizer doesn’t properly optimize between/ranges or in/lists compared with how it can optimize table joins.

Scott

15 years ago

How about this for a schema hack:
Modify the table to have two date columns, d1 and d2
Make the software insert the same date into each column

The query where d1 > lowerBound and d2 < upperBound

Peter Zaitsev

Author

15 years ago

Ryan,

With optimizer we always find issues. Some of them are bugs others are just limitations in logic. In this case it is exactly this case – Optimizer could consider and use such plans but it does not do it, as it can’t do hash join, loose index scan etc.

Peter Zaitsev

Author

15 years ago

Scott,

Why would you expect having ranges on different columns to help in this case ?

Scott

15 years ago

Maybe I misunderstood the nature of the problem here – is it not just between clauses, but any where clause that isn’t direct equality or IN ?

Peter Zaitsev

Author

15 years ago

Right. It does not only applies to between <, > complex combinations are all have the same issue.

As you can see the trick is – if we have AND for different index part we can put selections on different tables instead and just use EQ join on the main fact table.

Rob

15 years ago

I go to “a whole ‘nother level” with this. I include a table in my databases that includes :

dt date
yy smallint # YEAR()
qq tinyint # QUARTER()
yyqq int #
mm tinyint #
dd tinyint
wknum1 tinyint # WEEK()
yywknum1 int # YEARWEEK()
wknum2 tinyint # some things define the first wk of the yr as the first wk with Sunday in that yr,
# so you sometimes end up with 2 definitions for what week of the year it is.
yywknum2 int # yy * 1000 + wknum2
dow tinyint # DAYOFWEEK()
dom tinyint # DAYOFMONTH()
doy tinyint # DAYOFYEAR()
yymmdd int
# flag fields
wkend tinyint # handy, but not necessary –
# makes queries more natural language
# select gross_rev from rev, days where rev.dt = days.dt and days.wkend = 1 and days.yy = 2008
# no where clause function calls, OMG!
fdom tinyint # first day of the month
ldom tinyint # last day of the month
fwdom tinyint # first work day of the month
lwdom tinyint # last work day of the month
fwdoy tinyint # first work day of the yr
lwdoy tinyint # last work day of the yr
usholiday tinyint # is this a US holiday?wh
rain tinyint # did it rain on this day?
snow tinyint # did it snow on this day?
over85 tinyint # was it over 85F on this day?
fire tinyint # were there wild fires in the area on this day?

I spin up records for as much of the past as needed and for the next 10 – 50 years.

This gives you the ability to quickly do reporting that includes very complex date requirements:
show me the coffee shop gross revenues for Q1 of the past 3 years for the first workday of each month during those periods.

For business that have business that changes substantially based on certain events, you can have flag fields for those. Screamin’ Mimi’s Ice Cream might want to have a report comparing Summer Saturday revenue on day hotter than 85F vs days cooler than that. Outdoor restaurants, boat rentals, downtown store that caters to the work day crowds, etc.

For other queries, having an easy way to check if something is the last day of the month is a blessing. It is massively useful and I can’t imagine design a system without it.

Peter Zaitsev

Author

15 years ago

Rob,

Sure. This is a bit different though. You are adding whole another dimension tables for things which can’t be checked efficiently using index in original table (such as what day of the week it is) or which are not in the table (such as if it rained) – in this case you’ve got to add columns to fact table or dimension table as you did. That is a great thing to do.

My main point was a bit different one would be perfectly find to check range on date and group_id from the index on the table but MySQL Optimizer/Execution engine just not capable of doing so yet.

Ruslan Zakirov

15 years ago

Very surprised to see this in my RSS reader right after submitting http://bugs.mysql.com/bug.php?id=38550 that covers later cases in your research.

Peter Zaitsev

Author

15 years ago

Ruslan,

Why surprising ? Is it the first optimizer gotcha with MySQL you ran into ?

Ruslan Zakirov

15 years ago

Coincidence is surprising. Tired of workarounding so decided to push them into the tracker.

Howard

15 years ago

Hi I’m curious, wouldn’t the performance of the query become unacceptable again with any manner of sorting (resulting in filesort/temp)? You would almost certain sort from the info table rather than the joined date table.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How adding another table to JOIN can improve performance ?

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

Valkey/Redis Configurations and Persistent Setting of the Key Parameters

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How adding another table to JOIN can improve performance ?

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

Valkey/Redis Configurations and Persistent Setting of the Key Parameters

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation