Speeding up GROUP BY if you want aproximate results

Doing performance analyzes today I wanted to count how many hits come to the pages which get more than couple of visits per day. We had SQL logs in the database so It was pretty simple query:

select sum(cnt) from (select count(*) cnt from performance_log_080306 group by page having cnt>2) pv;

1	select sum(cnt) from (select count(*) cnt from performance_log_080306 group by page having cnt>2) pv;

Unfortunately this query ran for over half an hour badly overloaded server and I had to kill it in the end.

The reason for slowness was of course huge temporary table was required (there were about 5 million of distinct pages visited during that day) which resulted in on disk temporary table which as we know quite slow.

Of course it would be possible to allocate more memory to the temporary table or switch to filesort method and get result faster.

I however picked another road which is quite helpful in similar cases – I did not need exact result but approximate figure so I could trick MySQL to do group by a hash of the page instead of page itself:

mysql> select sum(cnt) from (select count(*) cnt from performance_log_080306 group by crc32(page) having cnt>3) pv
    -> ;
+----------+
| sum(cnt) |
+----------+
|  1127031 |
+----------+
1 row in set (31.22 sec)

mysql> select sum(cnt) from (select count(*) cnt from performance_log_080306 group by crc32(page) having cnt>3) pv

-> ;

+----------+

| sum(cnt) |

+----------+

| 1127031 |

+----------+

1 row in set (31.22 sec)

As you can see now it completes in about 30 seconds – quite handy.

Another trick I want to share which I use a lot when I want to analyze data distribution but table is to large is to just limit it to first number of rows:

mysql> select avg(length(page)) from (select page from performance_log_080306 limit 10000) tmp;
+-------------------+
| avg(length(page)) |
+-------------------+
|           70.0444 |
+-------------------+
1 row in set (0.08 sec)

mysql> select avg(length(page)) from (select page from performance_log_080306 limit 10000) tmp;

+-------------------+

| avg(length(page)) |

+-------------------+

| 70.0444 |

+-------------------+

1 row in set (0.08 sec)

Again this is not exact value but normally close enough to make a decision.

22 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dale Johnson

16 years ago

Peter, that’s a great result. But I don’t really understand though why sorting an int is over 120 times faster than sorting the site name.

Milos

16 years ago

Was this a typo:
HAVING cnt>2) pv;
on the first query.
And on the second:
HAVING cnt>3) pv

Could this be the performance gain? 🙂

Robert Synnott

16 years ago

I suppose that it depends on your application, but I couldn’t possibly agree with the second example in the general case. I’ve seen plenty of applications where the size of DB output (especially logs) is EXTREMELY variable, and where logs emitted by the startup code might be very different to those emitted by service X.

Tobias Petry

16 years ago

Dale, the problem is String-operations are really slow.
if you want to compare two strings all chars have to be the same, so it is very hard to compare two Strings. And if you want to compare 5million Strings this will be much more to do.
Comparing integers is not very hard, crc32 generates an 32-bit integer which can be compared by the CPU directly, Strings have to be compared in the application, there’s no native Method

Tobias

Peter Zaitsev

Author

16 years ago

Milos,

HAVING is checked after temporary table is populated so it does not give a lot of difference. In reality I did bunch of such queries with different HAVING values they all took about same number so I just copied wrong one I guess 🙂

Peter Zaitsev

Author

16 years ago

Robert,

If you become picky indeed there are cases when it does not work well, for example if you have auto populated string lookup table – this one tends to get short typical strings in lower values.

I’m not giving it as general recipe but it is very handy in large number of cases. Especially if you consider you can use much larger number – having sample of 1mil on 50mil row table will be even more close.

Of course to make it classical Monte Carlo method we would need to pick random values instead of taking first ones but order by rand() would be too slow and other methods of picking large number of random values are not so general.

BTW – this is the real average page length value so you can see how far off it is. It is worse than expected if we would have completely random sample but probably better than Innodb row count estimate 🙂

mysql> select avg(length(page)) from performance_log_080306;
+——————-+
| avg(length(page)) |
+——————-+
| 71.2611 |
+——————-+
1 row in set (24.84 sec)

Peter Zaitsev

Author

16 years ago

Dale, Tobias

There are couple of things here. Most important is because of long strings it usually spills out to be on disk temporary table. Even though we just store first 255 characters of the URL (technically you’ve got to use longer values if you want full urls) temporary tables are created on disk because of the size – 5 mil rows * 260 bytes per row (MEMORY tables use static storage) is over 1GB which is way too much for temporary table.

Table with two integers per row is however much smaller and can fit in in memory temporary table.

Strings comparisons are also slower though it is lesser impact.

Jay

16 years ago

If the execution engine is clever enough, then you should be able to clear this up by placing an index over the ‘page’ column; the sub-query should turn into a simple index scan (rather than full table scan, which I assume is happening?)

If it’s even cleverer, then it might realise that it can ignore any entries where there aren’t 2 successive entries for each ‘page’ value during the index scan.

Dale Johnson

16 years ago

Maybe I’m giving too much credit to MySQL here, but I assume it uses a heap table to hold the internal list of items, then does a sort, and groups the results to get the counts, but if the heap table size gets too large then it would switch over to a temporary table, probably MyISAM.

If this were the case, then if the field is a varchar(256), for example, then each row would use 256 bytes instead of the 71 bytes, since memory tables need to be fixed. This would make more sense in terms of a multiplier, since the int is 4 bytes, the ratio of int to char becomes 64:1. I guess the swap to disk becomes the real trigger to slow this one down.

Peter Zaitsev

Author

16 years ago

Jay,

Indeed you can add the index but how long it is going to take ? You can also have index when you create the table but this will slow down inserts. This is log file and query is pretty much adhoc – next time I may be looking for average page generation time number of SQL queries or memcache efficiency.

This is pretty much classical log table which has no indexes to have inserts as fast as possible which of course affects query times.

Peter Zaitsev

Author

16 years ago

Dale,

There are two reasons for temporary tables to be on disk. First it is created on disk if there are any BLOB or TEXT columns because these are not supported by MEMORY storage engine. Second it can be converted to on disk table if it growths too large.

In this case it was converted.

Sorting is whole another beast it uses temporary file if sort size is too large. For some queries you get both temporary table and temporary file but these are very different.

Arjen Lentz

16 years ago

Hi Peter… the method is a good hint, but the function used (CRC) is not.
CRC is not a hash function, it does not provide a good spread, and you will find a very high collision rate. CRCs are intended to identify bit-errors when transporting a block of data; it is *not* intended to compare different blocks of data, and it is in fact completely unsuited for that task.

Peter Zaitsev

Author

16 years ago

Arjen,

I know in theory CRC32 is not a hash function… but MySQL does not offer any fast integer hash functions…. MD5/SHA1 are crypto hash functions which is completely different task.

Even though CRC32 is created for the different task it works pretty good as a hash function.

On what kind of task did you see collision rate which would be far more than for hash function of given bits ?

Here is example:

mysql> select count(distinct page) from (select page from performance_log_080306 limit 100000) x;
+———————-+
| count(distinct page) |
+———————-+
| 83102 |
+———————-+
1 row in set (0.65 sec)

mysql> select count(distinct crc32(lower(page))) from (select page from performance_log_080306 limit 100000) x;
+————————————+
| count(distinct crc32(lower(page))) |
+————————————+
| 83101 |
+————————————+
1 row in set (0.44 sec)

mysql> select count(distinct md5(lower(page))) from (select page from performance_log_080306 limit 100000) x;
+———————————-+
| count(distinct md5(lower(page))) |
+———————————-+
| 83102 |
+———————————-+
1 row in set (0.72 sec)

As you see on about 80K objects we have 1 collision for CRC32 which is quite in line with what you should expect for 32bit hash function.

What kind of practical data did you see high collision rate for.

P.S I wish MySQL would get some good 64bit Int hash function built in 🙂

Diego

16 years ago

Hi Peter,

Would adding a ORDER BY NULL improve even better the speed of your query?

Peter Zaitsev

Author

16 years ago

Yes indeed order by null can be help for group by queries. How good depends on the group by execution mode and amount of groups you get in result set.

Andrew Sutherland

16 years ago

I’ve been playing with crc32 and saw some very high collision rates, so I investigated a bit further and discovered GROUP BY was only taking 2^32 as a max. If I cast crc32 to BINARY though, my results worked perfectly.

mysql> SELECT tag, COUNT(*) AS count, crc32(tag), BINARY crc32(tag)
-> FROM tags
-> GROUP BY BINARY crc32(tag)
-> ORDER BY count DESC
-> LIMIT 10
-> ;
+————+——-+————+——————-+
| tag | count | crc32(tag) | BINARY crc32(tag) |
+————+——-+————+——————-+
| spanish | 4576 | 874050868 | 874050868 |
| vocab | 4103 | 1178479308 | 1178479308 |
| vocabulary | 2786 | 2147483647 | 2425997691 |
| french | 2247 | 2147483647 | 2943733342 |
| english | 2087 | 746783232 | 746783232 |
| science | 1957 | 1729573288 | 1729573288 |
| latin | 1411 | 1421320458 | 1421320458 |
| chapter | 1274 | 2147483647 | 4186027310 |
| history | 1171 | 666529867 | 666529867 |
| words | 939 | 1904025228 | 1904025228 |
+————+——-+————+——————-+
10 rows in set (0.32 sec)

Notice that both chapter, vocabulary, and french come up with the same crc32 values, but different BINARY crc32 values. So if you GROUP BY with regular crc32, those will group together.

Thanks for the initial tip though, it’s made a 30 second query into a .32 second query.

Peter Zaitsev

Author

16 years ago

Andrew,

Something wrong with your query result – note 2147483647 value for CRC32 for few columns – this is MAX value for SIGNED int while CRC32 should be unsigned. Also for BINARY CRC32 you can get above this value. You might have found a bug in MySQL 🙂

Andrew Sutherland

16 years ago

Hmm – I initially thought it was signed, because PHP’s crc32 is signed. That is interesting though, the mysql docs definitely say unsigned. Do you think there’s any danger in using my method? I haven’t found any collisions yet.

Peter Zaitsev

Author

16 years ago

CRC32 in PHP is crazy in the sense it gives results on 32bit and 64bit platforms…

In your case casting as BINARY you should get string which is slower. Typically just CRC32 should work. Your high collision rate was because it was running into the wall with max signed integer. What MySQL version is this ?

Andrew Sutherland

16 years ago

MySQL 5.0.41

BINARY crc32(tag) is still much faster than tag. Is there a different way I can cast it to work with integers possibly?

Peter Zaitsev

Author

16 years ago

Upgrade to latest 5.0

simply CRC32(tag) should work.

Benni

15 years ago

Hi all,

I wonder if this procedure would seed up a very slow query in my application: There is a value ‘x’ in each entry in my database between 0.0 and 100.0 (float). I want to divide this value into N “bins” and make mysql count how many entries are in each bin. My approach for e.g. 2 bins would be an INTERVAL query:

SELECT INTERVAL(x,50,101), count(*) FROM table GROUP BY 1;

Problem is: mysql can’t use an index on x because of the INTERVAL function. Is there a way I can use BINARY CRC32() to speed things up? Are there better ways to solve this issue than using an INTERVAL?

Thank you!

Benni

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Speeding up GROUP BY if you want aproximate results

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Speeding up GROUP BY if you want aproximate results

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation