Doing performance analyzes today I wanted to count how many hits come to the pages which get more than couple of visits per day. We had SQL logs in the database so It was pretty simple query:

Unfortunately this query ran for over half an hour badly overloaded server and I had to kill it in the end.

The reason for slowness was of course huge temporary table was required (there were about 5 million of distinct pages visited during that day) which resulted in on disk temporary table which as we know quite slow.

Of course it would be possible to allocate more memory to the temporary table or switch to filesort method and get result faster.

I however picked another road which is quite helpful in similar cases – I did not need exact result but approximate figure so I could trick MySQL to do group by a hash of the page instead of page itself:

As you can see now it completes in about 30 seconds – quite handy.

Another trick I want to share which I use a lot when I want to analyze data distribution but table is to large is to just limit it to first number of rows:

Again this is not exact value but normally close enough to make a decision.

22 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dale Johnson

Peter, that’s a great result. But I don’t really understand though why sorting an int is over 120 times faster than sorting the site name.

Milos

Was this a typo:
HAVING cnt>2) pv;
on the first query.
And on the second:
HAVING cnt>3) pv

Could this be the performance gain? 🙂

Robert Synnott

I suppose that it depends on your application, but I couldn’t possibly agree with the second example in the general case. I’ve seen plenty of applications where the size of DB output (especially logs) is EXTREMELY variable, and where logs emitted by the startup code might be very different to those emitted by service X.

Tobias Petry

Dale, the problem is String-operations are really slow.
if you want to compare two strings all chars have to be the same, so it is very hard to compare two Strings. And if you want to compare 5million Strings this will be much more to do.
Comparing integers is not very hard, crc32 generates an 32-bit integer which can be compared by the CPU directly, Strings have to be compared in the application, there’s no native Method

Tobias

Jay

If the execution engine is clever enough, then you should be able to clear this up by placing an index over the ‘page’ column; the sub-query should turn into a simple index scan (rather than full table scan, which I assume is happening?)

If it’s even cleverer, then it might realise that it can ignore any entries where there aren’t 2 successive entries for each ‘page’ value during the index scan.

Dale Johnson

Maybe I’m giving too much credit to MySQL here, but I assume it uses a heap table to hold the internal list of items, then does a sort, and groups the results to get the counts, but if the heap table size gets too large then it would switch over to a temporary table, probably MyISAM.

If this were the case, then if the field is a varchar(256), for example, then each row would use 256 bytes instead of the 71 bytes, since memory tables need to be fixed. This would make more sense in terms of a multiplier, since the int is 4 bytes, the ratio of int to char becomes 64:1. I guess the swap to disk becomes the real trigger to slow this one down.

Arjen Lentz

Hi Peter… the method is a good hint, but the function used (CRC) is not.
CRC is not a hash function, it does not provide a good spread, and you will find a very high collision rate. CRCs are intended to identify bit-errors when transporting a block of data; it is *not* intended to compare different blocks of data, and it is in fact completely unsuited for that task.

Diego

Hi Peter,

Would adding a ORDER BY NULL improve even better the speed of your query?

Andrew Sutherland

I’ve been playing with crc32 and saw some very high collision rates, so I investigated a bit further and discovered GROUP BY was only taking 2^32 as a max. If I cast crc32 to BINARY though, my results worked perfectly.

mysql> SELECT tag, COUNT(*) AS count, crc32(tag), BINARY crc32(tag)
-> FROM tags
-> GROUP BY BINARY crc32(tag)
-> ORDER BY count DESC
-> LIMIT 10
-> ;
+————+——-+————+——————-+
| tag | count | crc32(tag) | BINARY crc32(tag) |
+————+——-+————+——————-+
| spanish | 4576 | 874050868 | 874050868 |
| vocab | 4103 | 1178479308 | 1178479308 |
| vocabulary | 2786 | 2147483647 | 2425997691 |
| french | 2247 | 2147483647 | 2943733342 |
| english | 2087 | 746783232 | 746783232 |
| science | 1957 | 1729573288 | 1729573288 |
| latin | 1411 | 1421320458 | 1421320458 |
| chapter | 1274 | 2147483647 | 4186027310 |
| history | 1171 | 666529867 | 666529867 |
| words | 939 | 1904025228 | 1904025228 |
+————+——-+————+——————-+
10 rows in set (0.32 sec)

Notice that both chapter, vocabulary, and french come up with the same crc32 values, but different BINARY crc32 values. So if you GROUP BY with regular crc32, those will group together.

Thanks for the initial tip though, it’s made a 30 second query into a .32 second query.

Andrew Sutherland

Hmm – I initially thought it was signed, because PHP’s crc32 is signed. That is interesting though, the mysql docs definitely say unsigned. Do you think there’s any danger in using my method? I haven’t found any collisions yet.

Andrew Sutherland

MySQL 5.0.41

BINARY crc32(tag) is still much faster than tag. Is there a different way I can cast it to work with integers possibly?

Benni

Hi all,

I wonder if this procedure would seed up a very slow query in my application: There is a value ‘x’ in each entry in my database between 0.0 and 100.0 (float). I want to divide this value into N “bins” and make mysql count how many entries are in each bin. My approach for e.g. 2 bins would be an INTERVAL query:

SELECT INTERVAL(x,50,101), count(*) FROM table GROUP BY 1;

Problem is: mysql can’t use an index on x because of the INTERVAL function. Is there a way I can use BINARY CRC32() to speed things up? Are there better ways to solve this issue than using an INTERVAL?

Thank you!

Benni