This is the third in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1, 2). This post is about subqueries, which in some cases execute outside-in instead of inside-out as users expect.

It’s easy to pick on subqueries in MySQL, so I’ll try to be gentle. The following query will surprise users unpleasantly:

Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT …) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It’s only when there is a SELECT inside it that it works poorly.)

The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it’s not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn’t been used widely.

To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn’t that much of a limitation. But it would be a big improvement if it were fixed.

See if you can guess what limitation number 4 will be!

21 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Yves Trudeau

Hi Baron,
Look at the execution plans for 6.0.x and 5.0.x in http://www.bigdbahead.com/?p=187

Dathan Pattishall

Guess for #4 is the Query Cache.

Patrick Casey

My guess for the “normal case” #4 is the threading model and coarse grained memory locks inside the system. Its most prevelent in thinks like the myisam key buffer cache, but there are still a lot of nasty big locks inside innodb that limit practical concurrency. Given that a modern server class machine has a minimum of 8/16 cores (physical/ht), a database which can only use, say, 4-6 of them effectively is leaving a lot of horsepower on the table.

My two personal vote for #4 because it impacts me but probably not most workload are:

#4 Inability to span innodb data files across disks or luns. Even if I use file per table, its not practical to put, say, tables apple.idb, pear.idb and banana.idb on one disk and taco.idb, burrito,idb, and burger.idb on another. On my oracle servers I can (and do) provision a lot of luns and put redo on two luns, data files on multiple luns, etc. On my INNODB servers its really not practical so we just end up with single big luns and pray the san can handle it. I’m reasonably sure I could shut down the database, move the file manually, and leave sym-links in their place, but that’s an extreme non starter for me where the production databases do no come down.

Patrick Casey

One interesting point on the subqueries:

Yes, I would dearly like to see this fixed as well. I’d even take a query hint that told the optimizer “eval the inner loop first” although I’d prefer the optimizer to be brighter.

As the OP said, a lot of times you are better off with a join, but not every case converts into join semantics easily or precisely and, frankly, a lot more people can write a logically correct inner select than will get a logically correct join.

tobi

It seems to me that many common subqueries can be transformed to joins very easily. The difficult cases can be left aside for the moment. Why not transform the simple cases?

tobi

Limitation 4 is that hash and merge joins are not supported which seriously destroys large query performance for reporting. It can also lead to a runtime that is exponential in the number of joins.

tobi

It also does not have a hash aggregate so that big data sets have to be sorted to be grouped. Sorting big data sets is not a good idea when you can just stuff the data in a hash table instead. Hash joins and aggregates are what drive large scale queries in postgres, sql server and oracle.

Patrick Casey

Tobi:
I believe the challenge with transforming inner selects into joins is that to do it accurately (without changing the results) requires information that, while readily available to a human, isn’t available to an optimizer.

Consider:

select * from food where id in (select food_id from meals where user = me)

The trivial rewrite is:

select food.* from food inner join meals on food.id = meals.food_id where meals.user = me

The problem is the result above *might* change the answer.

Consider:

Foods
1 taco
2 burger
3 apple

meals
breakfast me 3
lunch me 1
dinner me 1

It seems I had a taco for lunch and for dinner (not unreasonable).

Our original query gives back:
1 taco
3 apple

our rewritten query gives back
3 apple
1 taco
1 taco

In order to know if we can safely rewrite we have to know if the inner select will contain any duplicates. An application designer knows this. The optimizer cannot, although in certain special cases he can figure it out.

Likewise, in a query like this in the absence of an order by clause, the mere fact that we did a rewrite will likely change the row retrieval order. Technically this is legit since, in the absence of order by, the user has no right to expect a particular order. In practise though there’s lots of bad code out there that makes assumptions like this and global changes like that would break it.

Dmitri Mikhailov

The correct rewrite is:

select distinct foods.*
from foods
inner join meals
on meals.food_id = foods.id
where meals.user = ‘me’
;

– or –

select distinct foods.*
from foods
inner join meals
on meals.food_id = foods.id
and meals.user = ‘me’
;

– or –

select foods.*
from
( select distinct food_id
from meals
where meals.user = ‘me’
) meals
inner join foods
on foods.id = meals.food_id
;

Michael Peters

Does this limitation on subqueries also affect subqueries that are used in JOIN clauses? For instance

SELECT a.* FORM a JOIN (SELECT b.a_id FROM b WHERE b.blah = ‘asdf’)

Shlomi Noach

Next limitation might be: only one CPU per query.

Shlomi Noach

(make that one core)

Ryan Huddleston

MariaDB 5.3 has the latest work on subquery optimization:

http://kb.askmonty.org/v/mariadb-53-todo

A huge amount of optimizer improvements are being put into MariaDB 5.3 and it will be exciting to see it get released, as it’s been a long road to get at some of these features.

https://code.launchpad.net/~maria-captains/maria/5.3

Not sure when they plan to release for beta though.

Daniël van Eeden

My guesses for limitation number 4:
1. single-threaded dump/restore (mysqldump). Which is partly solved by mk-parallel-dump/mk-parallel-restore
2. single-threaded restore from a crashed database
3. performance within a stored procedure (Which Roland Bouman explained during fosdem)
4. the optimizer choosing the wrong index (esp. with prepared statements, but that’s not really the optimizers fault)
5. TCP scalability issues (SCTP could solve that)

Matt

It is my understanding that some of they subquery fixes were put into 5.4. Materialization to be specific. Doesn’t that solve this case?

Fedya

There is a problem where i dont see any solution without using sub-queries:

For example:

Suppliers
id, name

Documents
id (document version id)
document_id (document id)

Suppliers Documents
supp_id
doc_id (version id)
verified

One document can have many versions (ids). The last id for there document_id is current document version;

How to count supplier total last documents and documents verified ?

just have to use join clause like “documents.id IN (SELECT MAX(id) FROM documents)”
Of course we can run “SELECT MAX(id) FROM documents” and the just construct and insert string like (2,6,7..) in the main query, but that’s not the case

if any one have ideas, how to re wright this query…
thanks

Rand

Why not use an EXISTS query?

select *
from food
where exists
(select 1
from meals
where food.id = meals.food_id
and user = me)

I have had no issues with EXISTS/NOT EXISTS queries on mysql 5.0.x. They work quite well.

Foobarista

One thing that helps is to make sure you use a range variable on the subquery if there’s any columns in tables in the outer query that have the same name. I think what’s going on is mysql’s parser doesn’t do scoping very well and thinks the subquery may be correlated if there’s any ambiguity in column names. So, rewrite the above to something like

select * from a where a.id in (select x.id from b x);

York

I agree with Tobi that ‘hash join’ / ‘hash aggregate’ / ‘hash join hints’ are one of the most-wanted features in MySQL.