Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you’re looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?
For illustration purposes I’ve created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.
1 2 3 4 | CREATE TABLE `dt` ( `grp` int(10) unsigned NOT NULL, `slack` varchar(50) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
And I’m using some silly like query to illustrate some search is required, so count(*) can’t be optimized away for MyISAM Tables.
To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc limit 10; +------+-----+ | grp | cnt | +------+-----+ | 9879 | 300 | | 3888 | 298 | | 1793 | 297 | | 2082 | 294 | | 9729 | 293 | | 3760 | 292 | | 2588 | 290 | | 7918 | 290 | | 4055 | 290 | | 6950 | 290 | +------+-----+ 10 rows in set (21.12 sec) mysql> select count(*) cnt from dt where slack like "a%"; +---------+ | cnt | +---------+ | 2352996 | +---------+ 1 row in set (18.71 sec) |
This allows us to get results in about 40 seconds and as we can see is pretty inefficient – almost half of the time is spent counting the rows which are already traversed and counted for group by operation.
The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc; +-------+-----+ | grp | cnt | +-------+-----+ | 9879 | 300 | | 3888 | 298 | | 1793 | 297 | ... | 7195 | 180 | | 6703 | 178 | | 10000 | 107 | | 0 | 105 | +-------+-----+ 10001 rows in set (21.30 sec) |
We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by null; +-------+-----+ | grp | cnt | +-------+-----+ | 2257 | 251 | | 2418 | 266 | | 5842 | 258 | | 90 | 276 | ... | 2595 | 243 | | 4446 | 239 | | 9802 | 233 | | 1861 | 227 | +-------+-----+ 10001 rows in set (21.39 sec) |
This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).
The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:
1 2 | mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup order by cnt desc limit 11; ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP and ORDER BY |
Oops. Bad luck – for some reason you can’t use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.
OK Lets see how fast it is without order by (and limit):
1 2 3 4 5 6 7 8 9 10 11 12 13 | mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup; +-------+---------+ | grp | cnt | +-------+---------+ | 0 | 105 | | 1 | 259 | | 2 | 200 | ... | 9999 | 235 | | 10000 | 107 | | | 2352996 | +-------+---------+ 10002 rows in set (28.79 sec) |
As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL – it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.
So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | mysql> select * from (select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup) t order by cnt desc limit 11; +------+---------+ | grp | cnt | +------+---------+ | NULL | 2352996 | | 9879 | 300 | | 3888 | 298 | | 1793 | 297 | | 2082 | 294 | | 9729 | 293 | | 3760 | 292 | | 7918 | 290 | | 4055 | 290 | | 3749 | 290 | | 6950 | 290 | +------+---------+ 11 rows in set (29.68 sec) |
Use of extra temporary table for buffering helps us to get result set we’re looking for and workaround this limitation, though it of course comes with performance penalty.
This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.
Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.
UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is – it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dt type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 37748736 Extra: Using where; Using temporary; Using filesort 1 row in set (0.01 sec) mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dt type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 37748736 Extra: Using where; Using filesort 1 row in set (0.00 sec) |
As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.
Hi Peter,
Looks like you CAN order the results the way you want in a GROUP BY … WITH ROLLUP (at least in 5.1.11-beta – I haven’t checked other versions).
The key word is DESC after immediately after GROUP BY:
mysql> select IFNULL(date,’Total’) as ‘date_total’, no_of_files, raw_size as ‘raw_size’, compressed_size as ‘compressed_size’, avg_ratio as ‘avg_ratio’ from (select date(log_datestamp) as date, count(log_instance) as no_of_files, sum(log_uncompressed_size) as raw_size, sum(log_compressed_size) as compressed_size, round(avg(log_compress_ratio_percentage),3) as avg_ratio from logs_stats group by date desc with rollup) as a;
+————+————-+————–+—————–+———–+
| date_total | no_of_files | raw_size | compressed_size | avg_ratio |
+————+————-+————–+—————–+———–+
| 2007-09-21 | 16 | 2605136552 | 131024683 | 94.988 |
| 2007-09-20 | 171 | 26946514751 | 1724108146 | 93.246 |
| 2007-09-19 | 355 | 53270319908 | 11372204678 | 79.300 |
| 2007-09-18 | 375 | 57924126151 | 12854828516 | 80.481 |
……………………………………………………………….
| 2007-07-16 | 1 | 113834620 | 82608679 | 27.400 |
| 2007-06-08 | 1 | 132729275 | 12107490 | 90.900 |
| 2007-05-22 | 1 | 180724880 | 16671364 | 90.800 |
| 2007-05-03 | 1 | 180577870 | 16670719 | 90.800 |
| Total | 5641 | 602786174381 | 126053043926 | 80.638 |
+————+————-+————–+—————–+———–+
37 rows in set (0.04 sec)
“Sure, trust the gurus. Just don’t believe anything they say, especially when it comes to performance.”
– Steven Feuerstein
Cheers!
Daniel,
Please compare the query I’m saying which can’t be run and your query. They are VERY different.
Indeed you can get results sorted in any direction by columns you use for group by
but if you’re using group by with roll up you can’t sort by value of aggregate function such as count(), avg() sum() – hope it makes things a bit more clear.
Hello Peter-
I was surprised to see a last name such as yours.
I was told that some of my family routes could have
had a similar name. Father was from a town that sounded
like “Chipchevitz”. Relatives close by came from Sarne.
Indeed. Your last name directly translates as “Hare” in Russian. I guess it should be from some slavic country but probably not Russian directly as Russian last names usually formed using some apendix like “ov”, “ev” etc.
In MySQL 5 and up you could do the following. Since you already have the counts, as long as you grouop by the same fields, using the sum function in the upper query does not change the counts, and you gain speed by using the limit in the subquery.
SELECT grp
, sum(cnt)
FROM ( SELECT grp
, count(*) cnt
FROM dt
WHERE slack LIKE “a%”
GROUP BY grp
ORDER BY cnt DESC
LIMIT 10
) t1
GROUP BY grp WITH rollup;
-richard
Having same problem with count() for tagcloud generation. have 300.000 different tags (groups) and 1.5 mil records. takes 120 sec. to populate a tagcloud. Do you know any optimal ways to do tagclouds for huge websites?
I wanted to store total numebers and update time on each tag but I am using same tagcloud for different things. so I need to store several totals for each tag which will be 20 extra columns for each record.
vepa,
This may be long (though can be OK if the load is IO bound) – the first thing you should not run such complex queries in real time 🙂
The Tag Cloud only changes when an article is added or changed.
So geneate a normalised value tag information and store the information in a table.
Then when needed update the tag table with the tags used for the post.
Each day – re-normalise the tag table.
You’ll need to normalise the values otherwise you’ll end up with very large values over time – when what’s important is the relative values for the tags.
For optimum performance:
If you’re using PHP for the site – you can also use APC to store the site-wide TAG data (the HTML will be enough) – and then the code that updates the tag table – also updates the APC tag data.
Excellent article!!! Thank you ! It solved my issues with ORDER BY and GROUP BY ..WITH ROLLUP in a great way. I also noticed that this is significant faster than the regular WITH ROLLUP
Great work. I admire when someone sits down and explains the nitty gritty and simple and yet full true details about a technology such that it is understandable.
As an educator for more than 15 years, and in parallel herewith a professional in software with R&D and Software Production Line experience in Nokia, Ericsson, Motorola, CISCO etc., I greet you for your good work.
I was using your article to illustrate some ways of making summaries for my accountant (yes – she’s educating in MySQL! – why not – better than making software for her all the time !)
Humble Thanks,
Digital Age Institute Ltd.
David Svarrer
You have written this article about 5 years ago, but this is still worth to be recommended strongly!
It saved much times for me.
Thanks in advance,