There was a discussion on LinkedIn one month ago that caught my eye:
Database search by “within x number of miles” radius?
Anyone out there created a zipcode database and created a “search within x numer of miles” function ?
Thankful for any tips you can throw my way..J
A few people commented that some solutions wouldn’t scale. To understand why these sorts of geographic search queries are problematic in MySQL, it’s best to show some execution plans on dummy data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | EXPLAIN SELECT * FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 and 40 AND y BETWEEN 50 AND 60 AND col_a = 'set1'; +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 38 | NULL | 4032 | Using where | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ 1 row in set (0.00 sec) EXPLAIN SELECT * FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 and 40; +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 3 | NULL | 4032 | Using where | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ 1 row in set (0.01 sec) SELECT count(*) FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 AND 40 AND y BETWEEN 50 AND 60 AND col_a = 'set1'; +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row in set (0.00 sec) SELECT count(*) FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 and 40; +----------+ | count(*) | +----------+ | 1664 | +----------+ 1 row in set (0.01 sec) |
Did you notice that we estimate just as many rows on the first EXPLAIN as the second one? That doesn’t make any sense! The index covers x,y and col_a and should be eliminating a lot of searching, since there is only one row which meets this condition!
The reason for this is simply a missing feature of the MySQL optimizer – and it has to do with using x BETWEEN 30 and 40 (and it’s also true with x >= 30 AND x <= 40). Using a range like this prevents us from using the rest of the index. There is a workaround, but it’s not pretty:
1 2 3 4 5 6 7 8 9 10 11 | EXPLAIN SELECT * FROM coordinates WHERE x IN (30.30,30.61,31.18,31.26,31.72,32.11,32.25,32.30,32.42,32.91,33.27, 33.69,33.79,33.93,34.62,34.78,35.10,35.41,36.62,36.93,37.17,38.93,39.20, 39.56,39.84,39.87) AND y IN (59.58,56.81,57.27,54.14,56.43,51.87,54.59, 59.56,57.42,54.13,56.79,59.45) AND col_a = 'set1'; +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ | 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 38 | NULL | 312 | Using where | +----+-------------+-------------+-------+---------------+-----------+---------+------+------+-------------+ 1 row in set (0.00 sec) |
The ugliest thing about this, is that in real life you wouldn’t know all your possible values of X or Y, and so you may end up with a very big IN list. The workaround to this, is to create steppings of the value X and Y that we can use for indexes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ALTER TABLE coordinates ADD x_floor INT NOT NULL, ADD y_floor INT NOT NULL, DROP INDEX x_y_col_a, ADD INDEX x_floor_y_floor_col_a (x_floor, y_floor, col_a); UPDATE coordinates SET x_floor = FLOOR(x), y_floor = FLOOR(y); EXPLAIN SELECT * FROM coordinates WHERE x_floor in (30,31,32,33,34,35,36,37,38,39,40) AND y_floor in (50,51,52,53,54,55,56,57,58,59,60) AND col_a = 'set1'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: coordinates type: range possible_keys: x_floor_y_floor_col_a key: x_floor_y_floor_col_a key_len: 40 ref: NULL rows: 121 Extra: Using where 1 row in set (0.00 sec) |
Fantastic! The only remaining problem with this query is that it’s not quite identical to our original. In this query 60.79 will be floored to 60 (and erroneously included in our results):
1 2 3 4 5 6 7 8 9 | SELECT * FROM coordinates WHERE x_floor in (30,31,32,33,34,35,36,37,38,39,40) AND y_floor in (50,51,52,53,54,55,56,57,58,59,60) AND col_a = 'set1'; +-----+-------+-------+-------+-------+---------+---------+ | id | x | y | col_a | col_b | x_floor | y_floor | +-----+-------+-------+-------+-------+---------+---------+ | 144 | 33.79 | 54.59 | set1 | NULL | 33 | 54 | | 38 | 39.20 | 60.79 | set1 | NULL | 39 | 60 | +-----+-------+-------+-------+-------+---------+---------+ 2 rows in set (0.00 sec) |
However, this is a quick fix by re-including the original WHERE conditions (we are just no longer using an index on them):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | EXPLAIN SELECT * FROM coordinates WHERE x_floor in (30,31,32,33,34,35,36,37,38,39,40) AND y_floor in (50,51,52,53,54,55,56,57,58,59,60) AND col_a = 'set1' AND x BETWEEN 30 and 40 AND y BETWEEN 50 AND 60\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: coordinates type: range possible_keys: x_floor_y_floor_col_a key: x_floor_y_floor_col_a key_len: 40 ref: NULL rows: 121 Extra: Using where 1 row in set (0.00 sec) SELECT * FROM coordinates WHERE x_floor in (30,31,32,33,34,35,36,37,38,39,40) AND y_floor in (50,51,52,53,54,55,56,57,58,59,60) AND col_a = 'set1' AND x BETWEEN 30 and 40 AND y BETWEEN 50 AND 60; +-----+-------+-------+-------+-------+---------+---------+ | id | x | y | col_a | col_b | x_floor | y_floor | +-----+-------+-------+-------+-------+---------+---------+ | 144 | 33.79 | 54.59 | set1 | NULL | 33 | 54 | +-----+-------+-------+-------+-------+---------+---------+ 1 row in set (0.00 sec) |
Conclusion:
My examples were only on a small amount of data (16 000 rows) that fitted in memory, but the original query would have full table scanned if I didn’t use a FORCE INDEX hint. Add more data, and if X can’t filter out enough rows by itself this can create a real problem.
Workarounds are all very good, but they also make applications more difficult to maintain. If you really want to do these types of queries, you should give Sphinx a try.
Why not do the above with the GIS stuff? Is there a particular reason for not doing it?
Mark – The queries above work with InnoDB tables. It seems a shame to loose crash recovery just to be able to use the GIS features of MyISAM.
[Edit: Replied a bit too quick. While both InnoDB and MyISAM support spatial data types, only MyISAM supports indexing of them. You could index an individual point, but you wouldn’t be able to have an index covering the point and the col_a, as I needed in this example (results in an error: ERROR 1210 (HY000): Incorrect arguments to SPATIAL INDEX). GIS functions like DISTANCE, WITHIN are not supported by MySQL yet. See here ]
Ahh, yes, that makes sense.
Hi Morgan,
The results of this are actually a bit confusing to me. Shouldn’t the BETWEEN queries be optimized by the Range Access Method?
i.e., with x_y_col_a (x, y, col_a): (30, 50) <= (x, y) <= (40, 60)
Only thing I'm not sure about is the "col_a" at the end of the query which could be messing with that specific optimization.
Furthermore, aren't IN () statements technically considered to be range (because of the sorting and binary tree search)? They aren't noted as being able to be optimized by the "Range Access Method for Multi-Part Keys" either, so I don't see how they would perform like they are.
Darius
Morgan,
One thing I would notice the combinatorial nature of nested IN’s so you can have steps small enough so it can be looked up efficiently.
If you have 100 elements in IN list for both X and Y MySQL will end up evaluating 10000 (X,Y) combinations doing pretty much this number of point lookups
1000 or 10000 are reasonable values for most cases if your data fits in memory but be careful if you get into the millions.
Also in the newer MySQL versions there is a certain cut-off – if you get way too large IN list MySQL may start using index only for certain prefix. This is a good change as previous versions could be caused OOM by creating too large cascaded IN lists 🙂
Darius,
The confusion is justified 😉 It makes perfect sense that you should be able to do what I described with BETWEEN, but past the first column (x), MySQL won’t use the index any more. Hence the workaround.
“Furthermore, aren’t IN () statements technically considered to be range”
Range is a very generic term. Think about the differences between these two queries:
mysql> explain select * from coordinates where id between 10 and 30;
+—-+————-+————-+——-+—————+———+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+————-+——-+—————+———+———+——+——+————-+
| 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 24 | Using where |
+—-+————-+————-+——-+—————+———+———+——+——+————-+
1 row in set (0.00 sec)
mysql> explain select * from coordinates where id between 10 and 20 or id between 90 and 100;
+—-+————-+————-+——-+—————+———+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+————-+——-+—————+———+———+——+——+————-+
| 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 25 | Using where |
+—-+————-+————-+——-+—————+———+———+——+——+————-+
1 row in set (0.00 sec)
Clearly they are not the same, but EXPLAIN is unable to really distinguish. The second one is really two smaller ranges.
Morgan,
you said there must be a solution to cover not just a single point but rather a whole circle around a point. I developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/
Robert
Morgan,
you said there must be a solution to cover not just a single point but rather a whole circle around a point. I’ve developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/
Sure, it is also bound to the r-tree indexes and spatial extension but maybe it will be full supported by innodb later. In the meantime you can use a denormalized myisam table for that job.
Robert
Spatial search = Postgres.
SELECT count(*) FROM archive_data;
count
———
3063044
EXPLAIN ANALYZE SELECT count(*) FROM archive_data WHERE coords && ‘(45.64870078164,4.7909580578643),(45.68463321836,4.8423759421357)’::BOX;
QUERY PLAN
——————————————————————————————————————————————–
Aggregate (cost=36651.82..36651.83 rows=1 width=0) (actual time=7.104..7.104 rows=1 loops=1)
-> Bitmap Heap Scan on archive_data (cost=756.21..36613.53 rows=15315 width=0) (actual time=3.197..5.828 rows=11686 loops=1)
Recheck Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
-> Bitmap Index Scan on archive_data_coords (cost=0.00..752.38 rows=15315 width=0) (actual time=3.102..3.102 rows=11686 loops=1)
Index Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
Total runtime: 6.009 ms
(6 lignes)