Using GROUP BY WITH ROLLUP for Reporting Performance Optimization

Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you’re looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?

For illustration purposes I’ve created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.

 CREATE TABLE `dt` (
  `grp` int(10) unsigned NOT NULL,
  `slack` varchar(50) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

CREATE TABLE `dt` (

`grp` int(10) unsigned NOT NULL,

`slack` varchar(50) DEFAULT NULL

) ENGINE=MyISAM DEFAULT CHARSET=latin1

And I’m using some silly like query to illustrate some search is required, so count(*) can’t be optimized away for MyISAM Tables.

To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc limit 10;
+------+-----+
| grp  | cnt |
+------+-----+
| 9879 | 300 | 
| 3888 | 298 | 
| 1793 | 297 | 
| 2082 | 294 | 
| 9729 | 293 | 
| 3760 | 292 | 
| 2588 | 290 | 
| 7918 | 290 | 
| 4055 | 290 | 
| 6950 | 290 | 
+------+-----+
10 rows in set (21.12 sec)

mysql> select count(*) cnt from dt where slack like "a%";
+---------+
| cnt     |
+---------+
| 2352996 | 
+---------+
1 row in set (18.71 sec)

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc limit 10;

+------+-----+

| grp | cnt |

+------+-----+

| 9879 | 300 |

| 3888 | 298 |

| 1793 | 297 |

| 2082 | 294 |

| 9729 | 293 |

| 3760 | 292 |

| 2588 | 290 |

| 7918 | 290 |

| 4055 | 290 |

| 6950 | 290 |

+------+-----+

10 rows in set (21.12 sec)

mysql> select count(*) cnt from dt where slack like "a%";

+---------+

| cnt |

+---------+

| 2352996 |

+---------+

1 row in set (18.71 sec)

This allows us to get results in about 40 seconds and as we can see is pretty inefficient – almost half of the time is spent counting the rows which are already traversed and counted for group by operation.

The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc; 
+-------+-----+
| grp   | cnt |
+-------+-----+
|  9879 | 300 | 
|  3888 | 298 | 
|  1793 | 297 | 
...
|  7195 | 180 | 
|  6703 | 178 | 
| 10000 | 107 | 
|     0 | 105 | 
+-------+-----+
10001 rows in set (21.30 sec)

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by cnt desc;

+-------+-----+

| grp | cnt |

+-------+-----+

| 9879 | 300 |

| 3888 | 298 |

| 1793 | 297 |

...

| 7195 | 180 |

| 6703 | 178 |

| 10000 | 107 |

| 0 | 105 |

+-------+-----+

10001 rows in set (21.30 sec)

We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by null;    
+-------+-----+
| grp   | cnt |
+-------+-----+
|  2257 | 251 | 
|  2418 | 266 | 
|  5842 | 258 | 
|    90 | 276 | 
...
|  2595 | 243 | 
|  4446 | 239 | 
|  9802 | 233 | 
|  1861 | 227 | 
+-------+-----+
10001 rows in set (21.39 sec)

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp order by null;

+-------+-----+

| grp | cnt |

+-------+-----+

| 2257 | 251 |

| 2418 | 266 |

| 5842 | 258 |

| 90 | 276 |

...

| 2595 | 243 |

| 4446 | 239 |

| 9802 | 233 |

| 1861 | 227 |

+-------+-----+

10001 rows in set (21.39 sec)

This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).

The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup order by cnt desc limit 11;
ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP and ORDER BY

1 2	mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup order by cnt desc limit 11; ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP and ORDER BY

Oops. Bad luck – for some reason you can’t use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.

OK Lets see how fast it is without order by (and limit):

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup;                  
+-------+---------+
| grp   | cnt     |
+-------+---------+
|     0 |     105 | 
|     1 |     259 | 
|     2 |     200 | 
...
|  9999 |     235 | 
| 10000 |     107 | 
|       | 2352996 | 
+-------+---------+
10002 rows in set (28.79 sec)

mysql> select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup;

+-------+---------+

| grp | cnt |

+-------+---------+

| 0 | 105 |

| 1 | 259 |

| 2 | 200 |

...

| 9999 | 235 |

| 10000 | 107 |

| | 2352996 |

+-------+---------+

10002 rows in set (28.79 sec)

As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL – it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.

So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:

mysql> select * from (select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup) t order by cnt desc limit 11;
+------+---------+
| grp  | cnt     |
+------+---------+
| NULL | 2352996 | 
| 9879 |     300 | 
| 3888 |     298 | 
| 1793 |     297 | 
| 2082 |     294 | 
| 9729 |     293 | 
| 3760 |     292 | 
| 7918 |     290 | 
| 4055 |     290 | 
| 3749 |     290 | 
| 6950 |     290 | 
+------+---------+
11 rows in set (29.68 sec)

mysql> select * from (select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup) t order by cnt desc limit 11;

+------+---------+

| grp | cnt |

+------+---------+

| NULL | 2352996 |

| 9879 | 300 |

| 3888 | 298 |

| 1793 | 297 |

| 2082 | 294 |

| 9729 | 293 |

| 3760 | 292 |

| 7918 | 290 |

| 4055 | 290 |

| 3749 | 290 |

| 6950 | 290 |

+------+---------+

11 rows in set (29.68 sec)

Use of extra temporary table for buffering helps us to get result set we’re looking for and workaround this limitation, though it of course comes with performance penalty.

This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.

Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.

UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is – it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:

mysql> explain select  grp, count(*) cnt from dt where slack like "a%" group by grp  \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dt
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 37748736
        Extra: Using where; Using temporary; Using filesort
1 row in set (0.01 sec)

mysql> explain select  grp, count(*) cnt from dt where slack like "a%" group by grp with rollup  \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: dt
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 37748736
        Extra: Using where; Using filesort
1 row in set (0.00 sec)

mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dt

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 37748736

Extra: Using where; Using temporary; Using filesort

1 row in set (0.01 sec)

mysql> explain select grp, count(*) cnt from dt where slack like "a%" group by grp with rollup \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: dt

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 37748736

Extra: Using where; Using filesort

1 row in set (0.00 sec)

As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.

11 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Daniel Ciulinaru

16 years ago

Hi Peter,

Looks like you CAN order the results the way you want in a GROUP BY … WITH ROLLUP (at least in 5.1.11-beta – I haven’t checked other versions).
The key word is DESC after immediately after GROUP BY:

mysql> select IFNULL(date,’Total’) as ‘date_total’, no_of_files, raw_size as ‘raw_size’, compressed_size as ‘compressed_size’, avg_ratio as ‘avg_ratio’ from (select date(log_datestamp) as date, count(log_instance) as no_of_files, sum(log_uncompressed_size) as raw_size, sum(log_compressed_size) as compressed_size, round(avg(log_compress_ratio_percentage),3) as avg_ratio from logs_stats group by date desc with rollup) as a;
+————+————-+————–+—————–+———–+
| date_total | no_of_files | raw_size | compressed_size | avg_ratio |
+————+————-+————–+—————–+———–+
| 2007-09-21 | 16 | 2605136552 | 131024683 | 94.988 |
| 2007-09-20 | 171 | 26946514751 | 1724108146 | 93.246 |
| 2007-09-19 | 355 | 53270319908 | 11372204678 | 79.300 |
| 2007-09-18 | 375 | 57924126151 | 12854828516 | 80.481 |
……………………………………………………………….
| 2007-07-16 | 1 | 113834620 | 82608679 | 27.400 |
| 2007-06-08 | 1 | 132729275 | 12107490 | 90.900 |
| 2007-05-22 | 1 | 180724880 | 16671364 | 90.800 |
| 2007-05-03 | 1 | 180577870 | 16670719 | 90.800 |
| Total | 5641 | 602786174381 | 126053043926 | 80.638 |
+————+————-+————–+—————–+———–+
37 rows in set (0.04 sec)

“Sure, trust the gurus. Just don’t believe anything they say, especially when it comes to performance.”
– Steven Feuerstein

Cheers!

Peter Zaitsev

Author

16 years ago

Daniel,

Please compare the query I’m saying which can’t be run and your query. They are VERY different.

Indeed you can get results sorted in any direction by columns you use for group by
but if you’re using group by with roll up you can’t sort by value of aggregate function such as count(), avg() sum() – hope it makes things a bit more clear.

Larry Zaetz

16 years ago

Hello Peter-
I was surprised to see a last name such as yours.
I was told that some of my family routes could have
had a similar name. Father was from a town that sounded
like “Chipchevitz”. Relatives close by came from Sarne.

Peter Zaitsev

Author

16 years ago

Indeed. Your last name directly translates as “Hare” in Russian. I guess it should be from some slavic country but probably not Russian directly as Russian last names usually formed using some apendix like “ov”, “ev” etc.

richard

16 years ago

In MySQL 5 and up you could do the following. Since you already have the counts, as long as you grouop by the same fields, using the sum function in the upper query does not change the counts, and you gain speed by using the limit in the subquery.

SELECT grp
, sum(cnt)
FROM ( SELECT grp
, count(*) cnt
FROM dt
WHERE slack LIKE “a%”
GROUP BY grp
ORDER BY cnt DESC
LIMIT 10
) t1
GROUP BY grp WITH rollup;

-richard

vepa

15 years ago

Having same problem with count() for tagcloud generation. have 300.000 different tags (groups) and 1.5 mil records. takes 120 sec. to populate a tagcloud. Do you know any optimal ways to do tagclouds for huge websites?

I wanted to store total numebers and update time on each tag but I am using same tagcloud for different things. so I need to store several totals for each tag which will be 20 extra columns for each record.

Peter Zaitsev

Author

15 years ago

vepa,

This may be long (though can be OK if the load is IO bound) – the first thing you should not run such complex queries in real time 🙂

Dasher

15 years ago

The Tag Cloud only changes when an article is added or changed.

So geneate a normalised value tag information and store the information in a table.
Then when needed update the tag table with the tags used for the post.

Each day – re-normalise the tag table.

You’ll need to normalise the values otherwise you’ll end up with very large values over time – when what’s important is the relative values for the tags.

For optimum performance:
If you’re using PHP for the site – you can also use APC to store the site-wide TAG data (the HTML will be enough) – and then the code that updates the tag table – also updates the APC tag data.

sqlMontreal

13 years ago

Excellent article!!! Thank you ! It solved my issues with ORDER BY and GROUP BY ..WITH ROLLUP in a great way. I also noticed that this is significant faster than the regular WITH ROLLUP

David Svarrer

11 years ago

Great work. I admire when someone sits down and explains the nitty gritty and simple and yet full true details about a technology such that it is understandable.

As an educator for more than 15 years, and in parallel herewith a professional in software with R&D and Software Production Line experience in Nokia, Ericsson, Motorola, CISCO etc., I greet you for your good work.

I was using your article to illustrate some ways of making summaries for my accountant (yes – she’s educating in MySQL! – why not – better than making software for her all the time !)

Humble Thanks,
Digital Age Institute Ltd.
David Svarrer

Nahrae

11 years ago

You have written this article about 5 years ago, but this is still worth to be recommended strongly!
It saved much times for me.

Thanks in advance,

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using GROUP BY WITH ROLLUP for Reporting Performance Optimization

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using GROUP BY WITH ROLLUP for Reporting Performance Optimization

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation