April 21, 2014

Getting around optimizer limitations with an IN() list

There was a discussion on LinkedIn one month ago that caught my eye:

Database search by “within x number of miles” radius?

Anyone out there created a zipcode database and created a “search within x numer of miles” function ?
Thankful for any tips you can throw my way..

J

A few people commented that some solutions wouldn’t scale. To understand why these sorts of geographic search queries are problematic in MySQL, it’s best to show some execution plans on dummy data:

Did you notice that we estimate just as many rows on the first EXPLAIN as the second one? That doesn’t make any sense! The index covers x,y and col_a and should be eliminating a lot of searching, since there is only one row which meets this condition!

The reason for this is simply a missing feature of the MySQL optimizer – and it has to do with using x BETWEEN 30 and 40 (and it’s also true with x >= 30 AND x <= 40). Using a range like this prevents us from using the rest of the index. There is a workaround, but it’s not pretty:

The ugliest thing about this, is that in real life you wouldn’t know all your possible values of X or Y, and so you may end up with a very big IN list. The workaround to this, is to create steppings of the value X and Y that we can use for indexes:

Fantastic! The only remaining problem with this query is that it’s not quite identical to our original. In this query 60.79 will be floored to 60 (and erroneously included in our results):

However, this is a quick fix by re-including the original WHERE conditions (we are just no longer using an index on them):

Conclusion:
My examples were only on a small amount of data (16 000 rows) that fitted in memory, but the original query would have full table scanned if I didn’t use a FORCE INDEX hint. Add more data, and if X can’t filter out enough rows by itself this can create a real problem.

Workarounds are all very good, but they also make applications more difficult to maintain. If you really want to do these types of queries, you should give Sphinx a try.

About Morgan Tocker

Morgan is the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

Comments

  1. Mark Rose says:

    Why not do the above with the GIS stuff? Is there a particular reason for not doing it?

  2. Mark – The queries above work with InnoDB tables. It seems a shame to loose crash recovery just to be able to use the GIS features of MyISAM.

    [Edit: Replied a bit too quick. While both InnoDB and MyISAM support spatial data types, only MyISAM supports indexing of them. You could index an individual point, but you wouldn't be able to have an index covering the point and the col_a, as I needed in this example (results in an error: ERROR 1210 (HY000): Incorrect arguments to SPATIAL INDEX). GIS functions like DISTANCE, WITHIN are not supported by MySQL yet. See here ]

  3. Mark Rose says:

    Ahh, yes, that makes sense.

  4. Hi Morgan,

    The results of this are actually a bit confusing to me. Shouldn’t the BETWEEN queries be optimized by the Range Access Method?

    i.e., with x_y_col_a (x, y, col_a): (30, 50) <= (x, y) <= (40, 60)

    Only thing I'm not sure about is the "col_a" at the end of the query which could be messing with that specific optimization.

    Furthermore, aren't IN () statements technically considered to be range (because of the sorting and binary tree search)? They aren't noted as being able to be optimized by the "Range Access Method for Multi-Part Keys" either, so I don't see how they would perform like they are.

    Darius

  5. peter says:

    Morgan,

    One thing I would notice the combinatorial nature of nested IN’s so you can have steps small enough so it can be looked up efficiently.

    If you have 100 elements in IN list for both X and Y MySQL will end up evaluating 10000 (X,Y) combinations doing pretty much this number of point lookups

    1000 or 10000 are reasonable values for most cases if your data fits in memory but be careful if you get into the millions.

    Also in the newer MySQL versions there is a certain cut-off – if you get way too large IN list MySQL may start using index only for certain prefix. This is a good change as previous versions could be caused OOM by creating too large cascaded IN lists :)

  6. Darius,

    The confusion is justified ;) It makes perfect sense that you should be able to do what I described with BETWEEN, but past the first column (x), MySQL won’t use the index any more. Hence the workaround.

    “Furthermore, aren’t IN () statements technically considered to be range”

    Range is a very generic term. Think about the differences between these two queries:

    mysql> explain select * from coordinates where id between 10 and 30;
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 24 | Using where |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    1 row in set (0.00 sec)

    mysql> explain select * from coordinates where id between 10 and 20 or id between 90 and 100;
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 25 | Using where |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    1 row in set (0.00 sec)

    Clearly they are not the same, but EXPLAIN is unable to really distinguish. The second one is really two smaller ranges.

  7. Robert says:

    Morgan,
    you said there must be a solution to cover not just a single point but rather a whole circle around a point. I developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

    Robert

  8. Robert says:

    Morgan,

    you said there must be a solution to cover not just a single point but rather a whole circle around a point. I’ve developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

    Sure, it is also bound to the r-tree indexes and spatial extension but maybe it will be full supported by innodb later. In the meantime you can use a denormalized myisam table for that job.

    Robert

  9. Bob says:

    Spatial search = Postgres.

    SELECT count(*) FROM archive_data;
    count
    ———
    3063044

    EXPLAIN ANALYZE SELECT count(*) FROM archive_data WHERE coords && ‘(45.64870078164,4.7909580578643),(45.68463321836,4.8423759421357)’::BOX;
    QUERY PLAN
    ——————————————————————————————————————————————–
    Aggregate (cost=36651.82..36651.83 rows=1 width=0) (actual time=7.104..7.104 rows=1 loops=1)
    -> Bitmap Heap Scan on archive_data (cost=756.21..36613.53 rows=15315 width=0) (actual time=3.197..5.828 rows=11686 loops=1)
    Recheck Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
    -> Bitmap Index Scan on archive_data_coords (cost=0.00..752.38 rows=15315 width=0) (actual time=3.102..3.102 rows=11686 loops=1)
    Index Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
    Total runtime: 6.009 ms
    (6 lignes)

Speak Your Mind

*