This is a pretty simple approach I often use called to optimize web application performance if problem happens with few pages. If we have “everything is slow” problem looking at slow query logs may be better start.

So what could you do ?

Look at the information shown on the page which comes from database. This is important as this is information you need to have retrieved from database one way or another if you want to keep page look the same. Your task is simply to have this information delivered from the database most efficient way.

Investigate how dynamic is this information If information rarely changes or is static it may be better to pre-create or cache it. There are many cache and pre-creation techniques which you can use. Just remember avoiding accessing database is the best way you can optimize database access. This applies to anything else – if you can avoid dynamic page generation at all and have server cache to serve it it is even better.

Check if information retrieved from the database matches information you display Way to often much more information is retrieved from the database than used to generate the page. It can be mild as SELECT * FROM tbl instead of listing columns you actually need or it can be as bad as running SELECT * FROM tbl to count amount of rows in the table (no joking). Sometimes you would see 100 stories selected so one random of them will be displayed and similar things with filtering on application level. Sometimes you can do it more efficient on application level but generally you should try your queries to return you only information you need.

Check number of rows which are used to generate result set This is very important and often forgotten step. Some people think query is simple if it returns few rows while what really matters is amount of rows are analyzed by the query. For example SELECT COUNT(*) FROM links WHERE domain = ‘mysql.com’; will return only one row while may require to scan hundreds of thousands of rows (or index entries) to do so. Other common killer queries are GROUP BY Queries and Sort Queries – SELECT name,descr FROM titles ORDER BY rank DESC LIMIT 10 – If there is no proper index and query has to use “filesort” it may become a bit problem. Other thing to mention is JOINs – Joins are expensive (relatively of course) and for sure increase amount of rows which need to be used to generate result set – if you have to join 10 tables to compose the object it is much slower than getting the same data by single row read.

Check number of rows which actually need to be used to generate result sets Sometimes query needs to use many rows to provide result set just because schema is not optimally indexed – this is easy. For example our ORDER BY rank query is such – adding index on rank column maxes this query to use 10 rows to return 10 rows – exactly what we’re looking for. However our COUNT(*) query is different – even if you have index on domain it still can require a lot of rows scanned to provide result set. Such queries need to be redesigned rather than simply adjusted – for example summary table which holds number of links per domain would do in this case.

Check number of queries If you can retrieve same amount of data with one query it is better than doing so by several queries, assuming this query would not need to analyze much more rows because it is optimized differently. One of typical examples here would be SELECT * FROM tbl WHERE id=5 executed many times with different constants – it is worth to replace such queries with query using IN(5,7,4,56) instead. Do not however be obsessed with this. I’ve seen people trying to join all queries in single UNION (with padding to accommodate different types and number of columns) – this is not a good practice. However if you can reduce number of queries without a lot of complications to application architecture it is good thing to do.

10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dmitri Mikhailov

To gather applicationdatabase round-trip statistics: http://forge.mysql.com/snippets/view.php?id=15

pedantic

I agree with most of what you’re written, but I have to object to what I consider a common FUD:
the comment about only selecting those columns that are actually required for some “display”
This generally has zero impact on the performance of a query of the SELECT * FROM table where primary_key=x (with usual caveats about BLOBs and the like).
It is relatively easy to quickly enumerate literally dozens of factors (as you did mention, what really matters is the # of rows that query processing READS) that impact the performance of a query and that one would have to go at the bottom of the list. And this can easily be benchmarked–I have seen far too many ill-informed developers littering their code with attempts to “gain” this performance benchmark.
Frankly, I’m surprised to see you propagate it.

pedantic

Shame on me for not being clearer.
I wasn’t arguing that SELECT * is ideal practice (I never, ever use it). I was just using that for shorthand for “SELECT all columns from table” with all columns enumerated. The point of my response was that limiting the “projection” of a query has very little benefit (if any) in terms of performance, which is what your blog had claimed.
Now, in response to your claims:
Remember, you titled your entry “Database Optimizations for Web Applications”
emphasis on Web.
So your example of selecting a 1 million rows of 100 floats each is irrelevant since if you’re returning 1 million rows to a web browser you’ve got bigger problems.
Index coverage is certainly a valid point–granted.
As far as String conversion costs, this is completely obviated if you are using prepared statements
It is true that returning more data over the wire is generally more expensive than returning less (who could argue with that?), but I’m guided by Pareto–spend your time and effort on things that make a difference. And I would venture to say that in the vast majority of cases, by limiting your projection you will achieve at most negligible benefits.
And my objection is more of an architectural/design one–as an OO programmer everything is NOT just a big associative array, and if you litter your code with all different permutations of which fields you might need (which is very likely to change), you’ve created a big mess for very little benefit. Yet very many software shops continue to do such things.
I just wanted to offer a different view since some might tend to take what you’ve written here as canon (which btw, I generally do 🙂

Felix

I don’t think views can be used for summary data very usefully in mysql as of yet. Although that would be a really awsome thing to have on the list there. I guess it can be done with triggers in a somewhat sane way or else its up to application logic to update the summary data tables or fields.

Alexey

> Sometimes you would see 100 stories selected so one random of them will be displayed and similar things with filtering on application level.

Another very common bad practice is to do something like SELECT * FROM stories ORDER BY RND() LIMIT 1;
Most programmers seem to do it, and I’m not really sure that it’s faster than transferring all rows (from the query cache, usually) to the app, and randomly picking one of them.

Tim

Great tip. I’m bookmarking this one. I think (hope?) we may need this soon.

steve Mac

I applaud the effort.I use Summary tables to limit amount of rows analyzed.Peter very thankful to you to share such a quick and staright forward information with open heart =). I am bookmarking this too.