April 25, 2014

MySQL Limitations Part 3: Subqueries

This is the third in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1, 2). This post is about subqueries, which in some cases execute outside-in instead of inside-out as users expect.

It’s easy to pick on subqueries in MySQL, so I’ll try to be gentle. The following query will surprise users unpleasantly:

Users expect the inner query to execute first, then the results to be substituted into the IN() list. But what happens instead is usually a full scan or index scan of table a, followed by N queries to table b. This is because MySQL rewrites the query to make the inner query dependent on the outer query, which could be an optimization in some cases, but de-optimizes the query in many other cases. NOT IN(SELECT ...) queries execute badly, too. (Note: putting a literal list of items in the IN() clause performs fine. It's only when there is a SELECT inside it that it works poorly.)

The fix for this has been in progress for a few years, and Sergey Petrunia committed working code to the stalled 6.0 release. But it's not quite clear whether that code was a complete solution. It has not been in any GA or RC release, so it hasn't been used widely.

To be fair, many other database servers also have poor subquery performance, or have had it in the past and have fixed it. And many MySQL users have learned to simply write JOINs instead, so it isn't that much of a limitation. But it would be a big improvement if it were fixed.

See if you can guess what limitation number 4 will be!

About Baron Schwartz

Baron is the lead author of High Performance MySQL. He maintains a personal blog at Xaprb. Follow him at @xaprb or connect with him on LinkedIn.

Comments

  1. Foobarista says:

    One thing that helps is to make sure you use a range variable on the subquery if there’s any columns in tables in the outer query that have the same name. I think what’s going on is mysql’s parser doesn’t do scoping very well and thinks the subquery may be correlated if there’s any ambiguity in column names. So, rewrite the above to something like

    select * from a where a.id in (select x.id from b x);

  2. Hi Baron,
    Look at the execution plans for 6.0.x and 5.0.x in http://www.bigdbahead.com/?p=187

  3. Guess for #4 is the Query Cache.

  4. Patrick Casey says:

    My guess for the “normal case” #4 is the threading model and coarse grained memory locks inside the system. Its most prevelent in thinks like the myisam key buffer cache, but there are still a lot of nasty big locks inside innodb that limit practical concurrency. Given that a modern server class machine has a minimum of 8/16 cores (physical/ht), a database which can only use, say, 4-6 of them effectively is leaving a lot of horsepower on the table.

    My two personal vote for #4 because it impacts me but probably not most workload are:

    #4 Inability to span innodb data files across disks or luns. Even if I use file per table, its not practical to put, say, tables apple.idb, pear.idb and banana.idb on one disk and taco.idb, burrito,idb, and burger.idb on another. On my oracle servers I can (and do) provision a lot of luns and put redo on two luns, data files on multiple luns, etc. On my INNODB servers its really not practical so we just end up with single big luns and pray the san can handle it. I’m reasonably sure I could shut down the database, move the file manually, and leave sym-links in their place, but that’s an extreme non starter for me where the production databases do no come down.

  5. Patrick Casey says:

    One interesting point on the subqueries:

    Yes, I would dearly like to see this fixed as well. I’d even take a query hint that told the optimizer “eval the inner loop first” although I’d prefer the optimizer to be brighter.

    As the OP said, a lot of times you are better off with a join, but not every case converts into join semantics easily or precisely and, frankly, a lot more people can write a logically correct inner select than will get a logically correct join.

  6. tobi says:

    It seems to me that many common subqueries can be transformed to joins very easily. The difficult cases can be left aside for the moment. Why not transform the simple cases?

  7. tobi says:

    Limitation 4 is that hash and merge joins are not supported which seriously destroys large query performance for reporting. It can also lead to a runtime that is exponential in the number of joins.

  8. tobi says:

    It also does not have a hash aggregate so that big data sets have to be sorted to be grouped. Sorting big data sets is not a good idea when you can just stuff the data in a hash table instead. Hash joins and aggregates are what drive large scale queries in postgres, sql server and oracle.

  9. Patrick Casey says:

    Tobi:
    I believe the challenge with transforming inner selects into joins is that to do it accurately (without changing the results) requires information that, while readily available to a human, isn’t available to an optimizer.

    Consider:

    select * from food where id in (select food_id from meals where user = me)

    The trivial rewrite is:

    select food.* from food inner join meals on food.id = meals.food_id where meals.user = me

    The problem is the result above *might* change the answer.

    Consider:

    Foods
    1 taco
    2 burger
    3 apple

    meals
    breakfast me 3
    lunch me 1
    dinner me 1

    It seems I had a taco for lunch and for dinner (not unreasonable).

    Our original query gives back:
    1 taco
    3 apple

    our rewritten query gives back
    3 apple
    1 taco
    1 taco

    In order to know if we can safely rewrite we have to know if the inner select will contain any duplicates. An application designer knows this. The optimizer cannot, although in certain special cases he can figure it out.

    Likewise, in a query like this in the absence of an order by clause, the mere fact that we did a rewrite will likely change the row retrieval order. Technically this is legit since, in the absence of order by, the user has no right to expect a particular order. In practise though there’s lots of bad code out there that makes assumptions like this and global changes like that would break it.

  10. Dmitri Mikhailov says:

    The correct rewrite is:

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id = foods.id
    where meals.user = ‘me’
    ;

    - or -

    select distinct foods.*
    from foods
    inner join meals
    on meals.food_id = foods.id
    and meals.user = ‘me’
    ;

    - or -

    select foods.*
    from
    ( select distinct food_id
    from meals
    where meals.user = ‘me’
    ) meals
    inner join foods
    on foods.id = meals.food_id
    ;

  11. Michael Peters says:

    Does this limitation on subqueries also affect subqueries that are used in JOIN clauses? For instance

    SELECT a.* FORM a JOIN (SELECT b.a_id FROM b WHERE b.blah = ‘asdf’)

  12. Next limitation might be: only one CPU per query.

  13. (make that one core)

  14. Ryan Huddleston says:

    MariaDB 5.3 has the latest work on subquery optimization:

    http://kb.askmonty.org/v/mariadb-53-todo

    A huge amount of optimizer improvements are being put into MariaDB 5.3 and it will be exciting to see it get released, as it’s been a long road to get at some of these features.

    https://code.launchpad.net/~maria-captains/maria/5.3

    Not sure when they plan to release for beta though.

  15. Daniël van Eeden says:

    My guesses for limitation number 4:
    1. single-threaded dump/restore (mysqldump). Which is partly solved by mk-parallel-dump/mk-parallel-restore
    2. single-threaded restore from a crashed database
    3. performance within a stored procedure (Which Roland Bouman explained during fosdem)
    4. the optimizer choosing the wrong index (esp. with prepared statements, but that’s not really the optimizers fault)
    5. TCP scalability issues (SCTP could solve that)

  16. Ryan, I agree 100% that the MariaDB team is doing great work and it’s exciting to see it happening and becoming available for people to use.

  17. Matt says:

    It is my understanding that some of they subquery fixes were put into 5.4. Materialization to be specific. Doesn’t that solve this case?

  18. Fedya says:

    There is a problem where i dont see any solution without using sub-queries:

    For example:

    Suppliers
    id, name

    Documents
    id (document version id)
    document_id (document id)

    Suppliers Documents
    supp_id
    doc_id (version id)
    verified

    One document can have many versions (ids). The last id for there document_id is current document version;

    How to count supplier total last documents and documents verified ?

    just have to use join clause like “documents.id IN (SELECT MAX(id) FROM documents)”
    Of course we can run “SELECT MAX(id) FROM documents” and the just construct and insert string like (2,6,7..) in the main query, but that’s not the case

    if any one have ideas, how to re wright this query…
    thanks

  19. Rand says:

    Why not use an EXISTS query?

    select *
    from food
    where exists
    (select 1
    from meals
    where food.id = meals.food_id
    and user = me)

    I have had no issues with EXISTS/NOT EXISTS queries on mysql 5.0.x. They work quite well.

  20. EXISTS subqueries generally work as expected in MySQL.

  21. York says:

    I agree with Tobi that ‘hash join’ / ‘hash aggregate’ / ‘hash join hints’ are one of the most-wanted features in MySQL.

Speak Your Mind

*