We have an application which stores massive amount of urls. To save on indexes instead of using URL we index CRC32 of the URL which allows to find matching urls quickly. There is a bit of chance there would be some false positives but these are filtered out after reading the data so it works all pretty well.

If we just process urls one by one it works great:

Handling URLs one by one is however not efficient if you’re processing millions of them so we tried to do bulk fetches:

As you can see just using multiple column IN makes MySQL to pick doing full table scan in this case, even though the cardinality on the first column is almost perfect. I did some more testing and it looks like a bug or missing optimizer feature.

I should not be surprised though as multi-column in is not the most used MySQL feature out there.

For given application case we could simply rewrite query using more standard single column IN clause:

Theoretically speaking this query is not equivalent to the first one – because row having url_crc=2752937066 and url=’http://www.coxandforkum.com/’ would match it, while it should not. It however does not happen in our case as url_crc is functionally dependent on url so both queries are equivalent.

So we’ve got our work around and can forget about the issue and MySQL team gets yet another bug to deal with.
What worries me again is – this is very simple case which seems to to be generally broken which raises a question how good coverage MySQL tests have.

21 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Petya

How much faster this works than more common idea
of index on url column? Did you make any tests?

Mark Robson

Perhaps the optimiser could do a better job if you used the more conventional

WHERE (url_crc=12345 AND url=’http://something’) OR (url_crc=56789 AND url=’http://else’)

OR

Use a derived table, fetch the rows with the right url_crc first:

SELECT whatever FROM (SELECT cols FROM tbl WHERE url_crc IN (12345,56789, … )) WHERE (url_crc,url) IN ( … )

Just an idea.

Mark

tom

just avoid to put AND url=’http://something’ in your query; it’s quite useless as there are very few false positive and it’s efficient to handle them in the application logic

Xaprb

Peter, it’s not quite the same thing, but see also

http://dev.mysql.com/doc/refman/5.0/en/row-subqueries.html

I have a draft post about 2 years old on this topic, warning people about this 🙂 I thought such row constructors were never optimized, though the manual says that 5.0.26 and newer does optimize this (I have not tested). Your bug report says you see this on 5.0.54, so I guess the optimizer is filling in this functionality slowly.

tom

Hello Peter, I agree in part with you; this is the same as to use or not stored procedures 🙂
is it better to have the logic in mysql or in the application? I think the the answer is “depends”, as perhaps it’s in this case.
cpu time spent in scripting languages is usually worse then cpu time spent in mysql; but if using odd queries avoid correct using of indexes or incur in more disk seekes perhaps it’s better application logic postprocessing.

As in many other cases the best things is to make tests against your own data

Ruslan Zakirov

I think that at some number prepared statement will be faster than bulk.

May be UNION ALL is more native representation.

Also, wonder why mysql even doesn’t consider index merge.

Roland Bouman

Hi Peter,

I know it shouldn’t matter, but have you tried:

EXPLAIN SELECT url FROM 124pages.124pages WHERE (url_crc,url) = (484036220,’http://www.dell.com/’);

to see if it is due to IN or rather due to multiple columns?

If there is a difference, it seems likely that it is a bug, which may turn out to be trivial to fix.

ries

I am sorry but what is the use on having an CRC in your table?? You could only really figure out IF a URL exists, and only with some certainty so you need to check your result set . USE a SP for that, don’t even think to do this on the application level.

What if your boss comes to you and asks you : hey dude, how many URL’s do you have in your database that points to the website http://members.aye.net/

For sure you HAVE to do something like this SELECT * FROM table WHERE url LIKE (‘http://members.aye.net/%’);

What I would do is may be create a domain table to store domains only based on CRC32,
then create 26 tables for your URL’s and partition your URI’s based on the URI

so table domain stores : http://www.dell.com, http://www.vantwisk.nl, http://www.mysqlperformanceblog,com
then table URI stores : 2008/04/04/multi-column-in-clause-unexpected-mysql-issue/#more-361, blabla.html, myblog/entry/bla.html

On the doamin table you can create an idnex by CRC, but I would just use the.
On table URI you simple create an index on the first XX characters to keep index size low.

Data retrieval is done using a couple of stored procedures to get the right data and filter any duplicates.

Ries
PS: you could partition your tables if you need more. Since you didn’t mention how many rows you ant to store it’s hard to guess….
You did mention massive, how much is massive for you??? in the order of 1000mil records???

Ries

Ken

Peter,
Why use CRC32 at all… have your app use a common compression algorithm for the url, store the result… it will be unique, save space, etc….

Abhishek Soni

Hi all

I’ve a solution for above mentioned queries. After hard digging I found out that when we use IN clause, MySQL first executes the outer query i.e. the PRIMARY query as shown in the EXPLAIN for the query and then it executes the inner subquery.

To bypass this problem I used JOIN and the execution time reduced drastically from few seconds to few msec’s

I don’t know what is schema of your table. But here is what I would like to propose as a possible solution

EXPLAIN SELECT url FROM 106pages.106pages as t1 left outer join 106pages.106pages as t2 on t1.url = t2.url WHERE t2.url_crc
IN ((2752937066,3799762538);

I have used self join in this case. I haven’t executed the query because of lack of knowledge about table schema. hope it works.

Rick

I have a similar issue with the in clause (5.0.77). It seems that if there are more than 2 items in the list, the optimizer abandons use of an existing index.

Any ideas (besides moving to 5.1.xx)?

mysql> explain select * from projects where project_phase_id in (2,3);
+—-+————-+———-+——-+—————+—————+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+———-+——-+—————+—————+———+——+——+————-+
| 1 | SIMPLE | projects | range | proj_phase_id | proj_phase_id | 4 | NULL | 137 | Using where |
+—-+————-+———-+——-+—————+—————+———+——+——+————-+
1 row in set (0.00 sec)

mysql> explain select * from projects where project_phase_id in (2,3,4);
+—-+————-+———-+——+—————+——+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+———-+——+—————+——+———+——+——+————-+
| 1 | SIMPLE | projects | ALL | proj_phase_id | NULL | NULL | NULL | 757 | Using where |
+—-+————-+———-+——+—————+——+———+——+——+————-+
1 row in set (0.00 sec)

Saru

Index Performance :

CREATE TABLE Persons (
id int(11) NOT NULL DEFAULT 20,
name varchar(20) NOT NULL DEFAULT ” “,
age int(100) NOT NULL,
PRIMARY KEY (id,name)
) ENGINE=InnoDB;

insert into Persons (id,name) values (1,”name1″),(2,”name2”);

1. select id from Persons where id = 1; -> this query use PRIMARY KEY index
2. select id from Persons name = “name1″; -> internal index
3. select name from Persons where id=1 and name=”name1″ and age=20; -> ??
4. select name from Persons where id=1 and name=”name1” and age = 30; -> PRIMARY KEY index
5. select name from Persons where (id,name) =(1 , “name1”) and age = 30 ; -> PRIMARY KEY index

Please explain index lookups for above queries.
Query 4 and 5 are same.But which performance is best query 4 and query 5?