This is a pretty simple approach I often use called to optimize web application performance if problem happens with few pages. If we have “everything is slow” problem looking at slow query logs may be better start.
So what could you do ?
Look at the information shown on the page which comes from database. This is important as this is information you need to have retrieved from database one way or another if you want to keep page look the same. Your task is simply to have this information delivered from the database most efficient way.
Investigate how dynamic is this information If information rarely changes or is static it may be better to pre-create or cache it. There are many cache and pre-creation techniques which you can use. Just remember avoiding accessing database is the best way you can optimize database access. This applies to anything else – if you can avoid dynamic page generation at all and have server cache to serve it it is even better.
Check if information retrieved from the database matches information you display Way to often much more information is retrieved from the database than used to generate the page. It can be mild as SELECT * FROM tbl instead of listing columns you actually need or it can be as bad as running SELECT * FROM tbl to count amount of rows in the table (no joking). Sometimes you would see 100 stories selected so one random of them will be displayed and similar things with filtering on application level. Sometimes you can do it more efficient on application level but generally you should try your queries to return you only information you need.
Check number of rows which are used to generate result set This is very important and often forgotten step. Some people think query is simple if it returns few rows while what really matters is amount of rows are analyzed by the query. For example SELECT COUNT(*) FROM links WHERE domain = ‘mysql.com’; will return only one row while may require to scan hundreds of thousands of rows (or index entries) to do so. Other common killer queries are GROUP BY Queries and Sort Queries – SELECT name,descr FROM titles ORDER BY rank DESC LIMIT 10 – If there is no proper index and query has to use “filesort” it may become a bit problem. Other thing to mention is JOINs – Joins are expensive (relatively of course) and for sure increase amount of rows which need to be used to generate result set – if you have to join 10 tables to compose the object it is much slower than getting the same data by single row read.
Check number of rows which actually need to be used to generate result sets Sometimes query needs to use many rows to provide result set just because schema is not optimally indexed – this is easy. For example our ORDER BY rank query is such – adding index on rank column maxes this query to use 10 rows to return 10 rows – exactly what we’re looking for. However our COUNT(*) query is different – even if you have index on domain it still can require a lot of rows scanned to provide result set. Such queries need to be redesigned rather than simply adjusted – for example summary table which holds number of links per domain would do in this case.
Check number of queries If you can retrieve same amount of data with one query it is better than doing so by several queries, assuming this query would not need to analyze much more rows because it is optimized differently. One of typical examples here would be SELECT * FROM tbl WHERE id=5 executed many times with different constants – it is worth to replace such queries with query using IN(5,7,4,56) instead. Do not however be obsessed with this. I’ve seen people trying to join all queries in single UNION (with padding to accommodate different types and number of columns) – this is not a good practice. However if you can reduce number of queries without a lot of complications to application architecture it is good thing to do.
To gather applicationdatabase round-trip statistics: http://forge.mysql.com/snippets/view.php?id=15
I agree with most of what you’re written, but I have to object to what I consider a common FUD:
the comment about only selecting those columns that are actually required for some “display”
This generally has zero impact on the performance of a query of the SELECT * FROM table where primary_key=x (with usual caveats about BLOBs and the like).
It is relatively easy to quickly enumerate literally dozens of factors (as you did mention, what really matters is the # of rows that query processing READS) that impact the performance of a query and that one would have to go at the bottom of the list. And this can easily be benchmarked–I have seen far too many ill-informed developers littering their code with attempts to “gain” this performance benchmark.
Frankly, I’m surprised to see you propagate it.
Pedantic,
There are cases when SELECT * FROM table WHERE has limited overhead. There is always overhead (unless you list all fields) but it can be too small to be noticed.
There are a lot of factors besides BLOB columns:
– Query can be index covered if you select limited number of columns.
– Selecting all columns may require more network traffic and more packets.
– Conversion to the string and back is expensive for certain amount of strings.
…
Try benchmarking it for example with 1.000.000 row result set having 100 float columns, vs selecting only couple of them 🙂
So instead of thinking if it would slow things down in your case or not I would simply avoid it and use only columns you need.
There are other reasons SELECT * is bad practice besides performance – for example reffering to columns by offsets may break if column inserted between other columns by ALTER TABLE, it also may break if colum is added and JOIN now has duplicate columns etc.
Sometimes you want to present all data from the table and do not think which data is going to be used in particular case – it might be reasonable in particular case but it does not mean it would not be slower.
Speaking about other factors which may be added to the list – of course. I write a big deal about query optimization in the blog. This list is to say when there are things implemented suboptimally, it however does not mean if application passes this checklist it would not ever have problems 🙂
Shame on me for not being clearer.
I wasn’t arguing that SELECT * is ideal practice (I never, ever use it). I was just using that for shorthand for “SELECT all columns from table” with all columns enumerated. The point of my response was that limiting the “projection” of a query has very little benefit (if any) in terms of performance, which is what your blog had claimed.
Now, in response to your claims:
Remember, you titled your entry “Database Optimizations for Web Applications”
emphasis on Web.
So your example of selecting a 1 million rows of 100 floats each is irrelevant since if you’re returning 1 million rows to a web browser you’ve got bigger problems.
Index coverage is certainly a valid point–granted.
As far as String conversion costs, this is completely obviated if you are using prepared statements
It is true that returning more data over the wire is generally more expensive than returning less (who could argue with that?), but I’m guided by Pareto–spend your time and effort on things that make a difference. And I would venture to say that in the vast majority of cases, by limiting your projection you will achieve at most negligible benefits.
And my objection is more of an architectural/design one–as an OO programmer everything is NOT just a big associative array, and if you litter your code with all different permutations of which fields you might need (which is very likely to change), you’ve created a big mess for very little benefit. Yet very many software shops continue to do such things.
I just wanted to offer a different view since some might tend to take what you’ve written here as canon (which btw, I generally do 🙂
I don’t think views can be used for summary data very usefully in mysql as of yet. Although that would be a really awsome thing to have on the list there. I guess it can be done with triggers in a somewhat sane way or else its up to application logic to update the summary data tables or fields.
Pedantic,
You seems to agree in many cases it makes a difference. Again I’m not claiming it is always going to increase your performance dramatically. But if you select many columns which you do not use there is a chance things can be optimized. Sometimes it can be only few percent, sometimes it may be dramatic, ie in case you get index coverage for some tables.
1 million of 100 floats is of course simply to show there is the difference and string conversion is not free. Especially with connectors written in native interpreted languages it may slow down things dramatically. We’ve seen difference 5+ times compared to prepared statements. Prepared statements do help but there is overhead still for sending extra columns.
Speaking about premutation to select only collumns you need… Of course I do not mean that. In fact Performance is not only thing you typically would care about. I frequently would take decisions which are ie 10% slower but they allow to get simpler, more clear code etc.
From the optimizations mentined this is least important one, but I thought it is still worth to have it in the list, as if I would not someone else might have pointed it out 🙂
Thanks for comments 🙂
Felix,
Yes. MySQL Does not have materialized views but Summary tables is great way to limit amount of rows analyzed. This is one of most common techniques – if you need query which analyzes (aggregates) 1.000.000 rows to return you 10 you’re likely missing pre-created summary tables. It can be done with triggers or manually (sometimes you can do it much faster manually)
> Sometimes you would see 100 stories selected so one random of them will be displayed and similar things with filtering on application level.
Another very common bad practice is to do something like SELECT * FROM stories ORDER BY RND() LIMIT 1;
Most programmers seem to do it, and I’m not really sure that it’s faster than transferring all rows (from the query cache, usually) to the app, and randomly picking one of them.
Great tip. I’m bookmarking this one. I think (hope?) we may need this soon.
I applaud the effort.I use Summary tables to limit amount of rows analyzed.Peter very thankful to you to share such a quick and staright forward information with open heart =). I am bookmarking this too.