Last few days I had a lot of a lot of questions at MySQL Performance Forum as well as from our customers regarding query optimization… which had one thing in common – It is not query which needed to be optimized.
Way too frequently people design schema first and then think how the queries they are looking to be answered can be expressed using that schema… which just does not work as a lot of rows needs to be traversed to get results you’re looking for. Here are couple of examples:
GROUP BY Consider SELECT COUNT(*) cnt, page FROM log GROUP BY page ORDER BY cnt DESC limit 10, the query to display top popular pages. There are very many queries of this type in many web applications which do group by one field sort by another and then use LIMIT to output first few rows. This query is not going to be efficient as it needs to traverse a lot of rows to perform group by so for large number of rows you’ve simply got to use different solutions, such as summary table. With MySQL 5.0 it is also easy to implement one by use of triggers.
From Schema Design standpoint it is best to make sure your most frequent queries do not need GROUP BY. It can be reachable by normalization or other techniques.
COUNT Many people think counting is fast, some have heard MyISAM is instant with COUNT(*) without reading enough details to know it only applies to very special case – when there is no WHERE clause. Quite typically you need to page over results and even if you just show first 10 of them and do it by index COUNT(*) query can cause a lot of troubles. Imagine for example some sort of profile browsing at social networking site – SELECT COUNT(*) FROM profiles where STATE_ID=12 and CITY_ID=234 and GENDER=’M’. If number of rows is not too large using covering index ie (STATE_ID,CITY_ID,GENDER) is great because scanning index is frequently makes COUNT 5-10 times faster, sometimes more. This however still works up to certain number of rows in result set, with results set of 100.000+ it becomes pretty inefficient.
There is surely SQL_CALC_FOUND_ROWS but it can often make query slower than two queries, because it will not be able to use covering index to compute the rows. If the number of combinations is small, like in this example you can use precomputed count values with great success. You can also aggressively cache them if they are high – really who cares if there is 100345 or 100360 matching profiles ?
Yet another approach is to throw away count and run main query with LIMIT 1000 or so, cache results for all pages and if 1000 rows were returned write “1000 or more rows matched”. Many people dislike it because match counts as in Google make it look more professional but in practice for many applications users do not really care.
In some cases you can avoid COUNT completely by storing count values in the tables, in others you can but you at least should try to make them efficient – make sure all where clauses are using indexes, try to have queries index covered and avoid joins as much as possible.
ORDER BY I already wrote about ORDER BY LIMIT Optimization so I will not repeat it here. In the nutshell ORDER BY LIMIT can be very fast… if it is done using Index or it can be slow on large data sets if external filesort is used so make sure to check explain for this one. So frequently I see dynamic query generation when some query variants fall back to use filesort, and it does not take many of such queries to overload server.
Also watch high limit values and “post filtering” with Index based ORDER BY, these also can cause a lot of troubles.
So designing schema make sure for queries you’re going to run ORDER BY .. LIMIT can be performed using the index, for example simple JOIN is often showstopper for this.
“From Schema Design standpoint it is best to make sure your most frequent queries do not need GROUP BY. It can be reachable by normalization or other techniques.”
Is that a typo and you meant to say “denormalization”?
Lukas,
Actually in this case I meant normalization, you’re however right for many very queries data denormalization is what actually help.
Let me give you an example. Assume someone has Denormalized schema of Cities which contains rows (City,State,Country,Population). Now to diplay list of cities one may do SELECT DISTINCT City from Cities, which is same as select City from Cities group by City. Now assume you want to display countries but you want to display top Countires by population you may do SELECT Country,sum(population) pop from Cities group by Country order by pop desc limit 10 but if you would normalize your data you could simply store list of countries separately and just add country population field to the table.
I have had similar cases in my consulting too. I just recently wrote an article about it: http://www.oreillynet.com/pub/a/mysql/2007/03/01/optimize-mysql-rank-data.html
Great stuff, and an excellent point about MyISAM/InnoDB COUNT(*): The MyISAM benefit only comes into play when there’s no WHERE condition, no when there is! 🙂
One other quick tip:
If you do use a denormalized structure and use summary tables to store often-queried COUNTs and SUMs and the like, make sure you use SELECT SQL_NO_CACHE … FROM my_summary_table; to prevent the query cache from storing results for data that will be very quickly invalidated.
Cheers, and keep up the fantastic work, PeterZ! 🙂
Thanks Jay,
Regarding your comment about SQL_NO_CACHE you’re right however you should know what “quickly” really means. Is invalidation once per second is quickly ? Well it is if you only run same query once during this time but if you have 5.000 queries/sec and lets say 100 are same even with so frequent invalidation query cache may be worth it.
Also in many cases I see summary tables updated periodically rather than on every update.
But anyway you may often do better by caching stuff in memcache 🙂
About “design schema first”, I think that’s the way to do it.
But where people is missing is to plan the table growth and sizes. That’s where they fail.
Cesar,
Planing table growth is of course typical mistake however designing schema without looking at the queries is the other one. The same data may need to be set out different ways depending on which queries you’re going to run, of course if high performance is the goal. A lot of queries out where fail because schemas based on Object relationships and ER diagrams can’t run them efficiently as is.
If the query contains multiple “and 1” e.g. 10/12 “and 1”, will the query be significantly slower for small or medium db like 0.1-1 million records and adding around 10-15 rows daily?
Sometimes I use to generate a query for searching something based on 10/12 or more-less key basis, and use the above trick in my dynamic query for the keys user has not provided values for.