Doing performance analyzes today I wanted to count how many hits come to the pages which get more than couple of visits per day. We had SQL logs in the database so It was pretty simple query:
select sum(cnt) from (select count(*) cnt from performance_log_080306 group by page having cnt>2) pv;
Unfortunately this query ran for over half an hour badly overloaded server and I had to kill it in the end.
The reason for slowness was of course huge temporary table was required (there were about 5 million of distinct pages visited during that day) which resulted in on disk temporary table which as we know quite slow.
Of course it would be possible to allocate more memory to the temporary table or switch to filesort method and get result faster.
I however picked another road which is quite helpful in similar cases – I did not need exact result but approximate figure so I could trick MySQL to do group by a hash of the page instead of page itself:
mysql> select sum(cnt) from (select count(*) cnt from performance_log_080306 group by crc32(page) having cnt>3) pv -> ; +----------+ | sum(cnt) | +----------+ | 1127031 | +----------+ 1 row in set (31.22 sec)
As you can see now it completes in about 30 seconds – quite handy.
Another trick I want to share which I use a lot when I want to analyze data distribution but table is to large is to just limit it to first number of rows:
mysql> select avg(length(page)) from (select page from performance_log_080306 limit 10000) tmp; +-------------------+ | avg(length(page)) | +-------------------+ | 70.0444 | +-------------------+ 1 row in set (0.08 sec)
Again this is not exact value but normally close enough to make a decision.