Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you’re looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?

For illustration purposes I’ve created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.

And I’m using some silly like query to illustrate some search is required, so count(*) can’t be optimized away for MyISAM Tables.

To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:

This allows us to get results in about 40 seconds and as we can see is pretty inefficient – almost half of the time is spent counting the rows which are already traversed and counted for group by operation.

The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.

We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:

This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).

The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:

Oops. Bad luck – for some reason you can’t use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.

OK Lets see how fast it is without order by (and limit):

As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL – it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.

So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:

Use of extra temporary table for buffering helps us to get result set we’re looking for and workaround this limitation, though it of course comes with performance penalty.

This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.

Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.

UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is – it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:

As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.

11 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Daniel Ciulinaru

Hi Peter,

Looks like you CAN order the results the way you want in a GROUP BY … WITH ROLLUP (at least in 5.1.11-beta – I haven’t checked other versions).
The key word is DESC after immediately after GROUP BY:

mysql> select IFNULL(date,’Total’) as ‘date_total’, no_of_files, raw_size as ‘raw_size’, compressed_size as ‘compressed_size’, avg_ratio as ‘avg_ratio’ from (select date(log_datestamp) as date, count(log_instance) as no_of_files, sum(log_uncompressed_size) as raw_size, sum(log_compressed_size) as compressed_size, round(avg(log_compress_ratio_percentage),3) as avg_ratio from logs_stats group by date desc with rollup) as a;
+————+————-+————–+—————–+———–+
| date_total | no_of_files | raw_size | compressed_size | avg_ratio |
+————+————-+————–+—————–+———–+
| 2007-09-21 | 16 | 2605136552 | 131024683 | 94.988 |
| 2007-09-20 | 171 | 26946514751 | 1724108146 | 93.246 |
| 2007-09-19 | 355 | 53270319908 | 11372204678 | 79.300 |
| 2007-09-18 | 375 | 57924126151 | 12854828516 | 80.481 |
……………………………………………………………….
| 2007-07-16 | 1 | 113834620 | 82608679 | 27.400 |
| 2007-06-08 | 1 | 132729275 | 12107490 | 90.900 |
| 2007-05-22 | 1 | 180724880 | 16671364 | 90.800 |
| 2007-05-03 | 1 | 180577870 | 16670719 | 90.800 |
| Total | 5641 | 602786174381 | 126053043926 | 80.638 |
+————+————-+————–+—————–+———–+
37 rows in set (0.04 sec)

“Sure, trust the gurus. Just don’t believe anything they say, especially when it comes to performance.”
– Steven Feuerstein

Cheers!

Larry Zaetz

Hello Peter-
I was surprised to see a last name such as yours.
I was told that some of my family routes could have
had a similar name. Father was from a town that sounded
like “Chipchevitz”. Relatives close by came from Sarne.

richard

In MySQL 5 and up you could do the following. Since you already have the counts, as long as you grouop by the same fields, using the sum function in the upper query does not change the counts, and you gain speed by using the limit in the subquery.

SELECT grp
, sum(cnt)
FROM ( SELECT grp
, count(*) cnt
FROM dt
WHERE slack LIKE “a%”
GROUP BY grp
ORDER BY cnt DESC
LIMIT 10
) t1
GROUP BY grp WITH rollup;

-richard

vepa

Having same problem with count() for tagcloud generation. have 300.000 different tags (groups) and 1.5 mil records. takes 120 sec. to populate a tagcloud. Do you know any optimal ways to do tagclouds for huge websites?

I wanted to store total numebers and update time on each tag but I am using same tagcloud for different things. so I need to store several totals for each tag which will be 20 extra columns for each record.

Dasher

The Tag Cloud only changes when an article is added or changed.

So geneate a normalised value tag information and store the information in a table.
Then when needed update the tag table with the tags used for the post.

Each day – re-normalise the tag table.

You’ll need to normalise the values otherwise you’ll end up with very large values over time – when what’s important is the relative values for the tags.

For optimum performance:
If you’re using PHP for the site – you can also use APC to store the site-wide TAG data (the HTML will be enough) – and then the code that updates the tag table – also updates the APC tag data.

sqlMontreal

Excellent article!!! Thank you ! It solved my issues with ORDER BY and GROUP BY ..WITH ROLLUP in a great way. I also noticed that this is significant faster than the regular WITH ROLLUP

David Svarrer

Great work. I admire when someone sits down and explains the nitty gritty and simple and yet full true details about a technology such that it is understandable.

As an educator for more than 15 years, and in parallel herewith a professional in software with R&D and Software Production Line experience in Nokia, Ericsson, Motorola, CISCO etc., I greet you for your good work.

I was using your article to illustrate some ways of making summaries for my accountant (yes – she’s educating in MySQL! – why not – better than making software for her all the time !)

Humble Thanks,
Digital Age Institute Ltd.
David Svarrer

Nahrae

You have written this article about 5 years ago, but this is still worth to be recommended strongly!
It saved much times for me.

Thanks in advance,