When doing performance analyzes you often would want to see 95 percentile, 99 percentile and similar values. The “average” is the evil of performance optimization and often as helpful as “average patient temperature in the hospital”.
Lets set you have 10000 page views or queries and have average response time of 1 second. What does it mean ? Really nothing – may be one page view was 10000 seconds and the rest was in low milliseconds or may be you had every single page view taking 1 second, which are completely different.
You also do not really care about average performance – the goal of good user experience is majority of users to have good experience and average is not a good fit here. Defining your response time goal in 95 or 99 percentile is much better. Say you say 99 percentile response time should be one second, this means only 1 percent of queries/page views are allowed to take more than that. For larger systems defining (increasing) response times for 99.9 or even 99.99 percentile numbers often make sense.
It also often makes sense to define response time goals separately for different transactions – the AJAX widget response time requirements may be very different from the slow search page.
So you have defined your response time in terms of 95/99 percentile and get your logs in the table, so how to get the data if MySQL only provides you the avg:
mysql> select count(*),avg(wtime) from performance_log_081128 where page_type='search'; +----------+-----------------+ | count(*) | avg(wtime) | +----------+-----------------+ | 106859 | 1.4469140766532 | +----------+-----------------+ 1 row in set (2.08 sec)
The average response time here is for example; the real data what we need is number of rows which matches for given query type.
Dividing the count by 100 we get our 1% of values and dividing by 20 5% of values, now we can get the response time we concerned about simply by running following order-by queries:
mysql> select wtime from performance_log_081128 where page_type='search' order by wtime desc limit 1068,1; +---------+ | wtime | +---------+ | 10.1007 | +---------+ 1 row in set (2.06 sec) mysql> select wtime from performance_log_081128 where page_type='search' order by wtime desc limit 5342,1; +---------+ | wtime | +---------+ | 5.09297 | +---------+ 1 row in set (2.06 sec)
So for this system the 95 percentile is just over 5 sec (some 3 times more than the average) and 99% percentile is just a bit over 10 seconds (6 times more than average). The both numbers are horrible and system surely needs to be fixed.
These numbers are to illustrate – the percentile numbers can be quite different from average numbers (it is not rare to see 99 percentile to be order of magnitude different from the average) and this is what you really need to focus on.
Looking at the numbers from the business standpoint try to understand what these really are. In some cases I see rather bad percentile on the backend which are not really the problem for the business because there is a cache up front anyway. If 99% of requests are coming from the cache and you observe certain 99 percentile response time on the backend it is often 99.99 percentile response time which is a lot different – you often can afford 1/10000 requests to stall for few seconds, because things outside of your control (like packet loss at client side) would be responsible for larger amount of delays.
Be careful though – the “random” delays, for example if system was busy and delayed servicing request is one thing, “systematic” delays, when response time is always bad in given conditions can be much worse problems. You do not want your best client to suffer for example, even if he is the only one.