September 6, 2006

Wrong GROUP BY makes your queries fragile

Posted by peter

This is probably well known issue for everyone having some MySQL experience or experience with any other SQL database. Still I see this problem in many production applications so it is worth to mention it, especially as it is connected to MySQL Performance. No it might not affect MySQL Performance per say but it limits our ability tune MySQL Performance as queries become fragile - changing execution plan leads to different query results.

So what I’m speaking about ?

Lets say you have query something like SELECT A,B,MAX(C) FROM TBL GROUP BY A - what would you expect from such query ? Column A is part of group by so its value is same for whole group. MAX(C) is also particular value for each group, while B is not part of GROUP BY and may well correspond to different values. Which one are you looking to get ? In fact this is where results becomes non-deterministic and fragile - any B from the group could be returned, while you might be expecting some particular one.

Many other DBMS and ANSI SQL Simply forbids such queries, MySQL is more permissive and will return you first B it runs into. This value however can be dependent on selected execution plan and change if you add some indexes. Also plan may change as database statistics changes and your query may become broken without any reason.

My Advice is to stay away from such group by statements, however as usually there are exceptions and such GROUP BY statements may be faster than alternatives. Assume for example we have non-normalized table (or join result) where two columns have 1-1 relationship. For example we have user id and login which are both unique. In this case running

SELECT id,login,max(login_time) FROM log GROUP BY id,login May be replaced by SELECT id,login,max(login_time) FROM log GROUP BY id which may be faster especially if GROUP BY is executed via sorting.

In this case even though we do not know which value from the group will be returned we do not care as it will be same for all rows.

Slow Query Log analyzes tools

Posted by peter

MySQL has simple but quite handy feature - slow query log, which allows you to log all queries which took over define number of seconds to execute. There is also an option to enable logging queries which do not use indexes even if they take less time (–log-queries-not-using-indexes)

Slow query log is great to spot really slow queries which are often good candidates for optimization but it has few serious problems which limits extent to which it is helpful. First - it only allows you to set slow query time in seconds, having 1 second minimum value. For most of interactive applications this is way too large - if you’re developing Web application you probably want whole page to be generated less in 1 second, which issues many queries during generation. Second - if you enable option to log queries which do not use indexes it well can be flooded with fast and efficient queries, which just happen to do full table scans - for example if you would be having drop down list of states in your application and use SELECT * FROM STATES for that it would trigger and log the query.

Taking other Approach

For our clients we often need to find a queries which impact application the most. It does not always have to be slowest queries - query taking 10ms and run 1.000 times per second puts more load on server than 10 seconds query running once per second. We of course want to get rid of really slow queries but to really optimize application throughput queries which generate most of the load need to be investigated

[read more...]


This page was found by: mysql group by havin...