There was a discussion on LinkedIn one month ago that caught my eye:

Database search by “within x number of miles” radius?

Anyone out there created a zipcode database and created a “search within x numer of miles” function ?
Thankful for any tips you can throw my way..

J

A few people commented that some solutions wouldn’t scale. To understand why these sorts of geographic search queries are problematic in MySQL, it’s best to show some execution plans on dummy data:

Did you notice that we estimate just as many rows on the first EXPLAIN as the second one? That doesn’t make any sense! The index covers x,y and col_a and should be eliminating a lot of searching, since there is only one row which meets this condition!

The reason for this is simply a missing feature of the MySQL optimizer – and it has to do with using x BETWEEN 30 and 40 (and it’s also true with x >= 30 AND x <= 40). Using a range like this prevents us from using the rest of the index. There is a workaround, but it’s not pretty:

The ugliest thing about this, is that in real life you wouldn’t know all your possible values of X or Y, and so you may end up with a very big IN list. The workaround to this, is to create steppings of the value X and Y that we can use for indexes:

Fantastic! The only remaining problem with this query is that it’s not quite identical to our original. In this query 60.79 will be floored to 60 (and erroneously included in our results):

However, this is a quick fix by re-including the original WHERE conditions (we are just no longer using an index on them):

Conclusion:
My examples were only on a small amount of data (16 000 rows) that fitted in memory, but the original query would have full table scanned if I didn’t use a FORCE INDEX hint. Add more data, and if X can’t filter out enough rows by itself this can create a real problem.

Workarounds are all very good, but they also make applications more difficult to maintain. If you really want to do these types of queries, you should give Sphinx a try.

9 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mark Rose

Why not do the above with the GIS stuff? Is there a particular reason for not doing it?

Mark Rose

Ahh, yes, that makes sense.

Darius Jahandarie

Hi Morgan,

The results of this are actually a bit confusing to me. Shouldn’t the BETWEEN queries be optimized by the Range Access Method?

i.e., with x_y_col_a (x, y, col_a): (30, 50) <= (x, y) <= (40, 60)

Only thing I'm not sure about is the "col_a" at the end of the query which could be messing with that specific optimization.

Furthermore, aren't IN () statements technically considered to be range (because of the sorting and binary tree search)? They aren't noted as being able to be optimized by the "Range Access Method for Multi-Part Keys" either, so I don't see how they would perform like they are.

Darius

Peter Zaitsev

Morgan,

One thing I would notice the combinatorial nature of nested IN’s so you can have steps small enough so it can be looked up efficiently.

If you have 100 elements in IN list for both X and Y MySQL will end up evaluating 10000 (X,Y) combinations doing pretty much this number of point lookups

1000 or 10000 are reasonable values for most cases if your data fits in memory but be careful if you get into the millions.

Also in the newer MySQL versions there is a certain cut-off – if you get way too large IN list MySQL may start using index only for certain prefix. This is a good change as previous versions could be caused OOM by creating too large cascaded IN lists 🙂

Robert

Morgan,
you said there must be a solution to cover not just a single point but rather a whole circle around a point. I developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

Robert

Robert

Morgan,

you said there must be a solution to cover not just a single point but rather a whole circle around a point. I’ve developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

Sure, it is also bound to the r-tree indexes and spatial extension but maybe it will be full supported by innodb later. In the meantime you can use a denormalized myisam table for that job.

Robert

Bob

Spatial search = Postgres.

SELECT count(*) FROM archive_data;
count
———
3063044

EXPLAIN ANALYZE SELECT count(*) FROM archive_data WHERE coords && ‘(45.64870078164,4.7909580578643),(45.68463321836,4.8423759421357)’::BOX;
QUERY PLAN
——————————————————————————————————————————————–
Aggregate (cost=36651.82..36651.83 rows=1 width=0) (actual time=7.104..7.104 rows=1 loops=1)
-> Bitmap Heap Scan on archive_data (cost=756.21..36613.53 rows=15315 width=0) (actual time=3.197..5.828 rows=11686 loops=1)
Recheck Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
-> Bitmap Index Scan on archive_data_coords (cost=0.00..752.38 rows=15315 width=0) (actual time=3.102..3.102 rows=11686 loops=1)
Index Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
Total runtime: 6.009 ms
(6 lignes)