Your MySQL server is overloaded. You see hundreds of running queries in the SHOW PROCESSLIST taking many seconds to run, or can’t connect at all because all connections slots are busy. If you have worked with MySQL long enough you surely have seen it, probably more than once. This is what I would call “MySQL server overload” – having more work than the server can possibly handle. At this point I see people often jumping to the conclusion that something went wrong with MySQL and focus all their effort on this belief. This is also often how we see questions framed when they are filed with our Support or to Emergency Consulting.

In fact there are two very distinct causes for such a situation – and to find the resolution most effectively you need to understand what you’re dealing with. These are:

1. MySQL gets more work than normal. More work may mean more queries or a different query mix. Increasing the amount of more complicated queries will cause an increased amount of work even if the query rate goes down. The reasons for this may vary – it might be the load spike due to variety of external factors; a Google bot coming and starting indexing heavy and uncached pages; it can also be caused by Memcache (or other Cache) expiration or going down as well as many other reasons.

2. MySQL processes the usual amount of work slower. There are several reasons why MySQL may start acting slower. It can be caused by an increased level of concurrency, reduced buffer pool efficiency (caching), changed query plans, accumulated unpurged data in Innodb tables, as well as variety of external factors (to MySQL) – such as running concurrent backup, Battery on RAID volume going bad, effects of co-tenants in Virtualized environments.

For both of those points above I’ve listed some of examples of WHY it could happen: The list is far from complete but my main point is that both of those conditions have a lot of reasons why they can happen and typically both of them need to be considered.

Now to the question: “How do you know which case are you dealing with?”

To be able to answer which case you’re dealing with you need to have some historical data to serve as a baseline. If you’re running Percona Monitoring Plugins you can look at the graphs for queries (of the different kind), number of rows traversed as well as IO happening. In many cases this will be enough to see if MySQL is getting much more load because there will be spike in amount of Selects it is handling with an appropriate increase in the number of rows traversed and IO.

Sometimes, though, indicators may be pointing in different directions, and as such, both cases are possible. For example, a lower number of Select queries and Higher amount of rows traversed per second may mean both that some queries changed their execution plans to worsen your normal workload, or the workload has changed and you’re getting a lot more complicated queries.

To really understand this you need to look at the historical information on query basics. For a long time you could do this with the Percona Toolkit tool pt-query-digest’s query history feature, though few people would go through the process of setting up regular log parsing to create such history.

A more user-friendly tool that can help you answer this question is Percona Cloud Tools Query Analytics tool (currently in beta). It will analyze your queries and keep regular workload summaries over time. As such you can take a look at the queries server that has been running at the time of the incident and compare it to the historical data. Are you running more queries or less? Are there some queries that became a lot more frequent? Started taking more time? You can find exact answers to all of those questions. If you find queries that have been taking a longer time than before, then you will be able to see when and where the increase in response time comes about by examining a lot more roles than usual (often pointing to changed plans), locking or increased disk IO.

Summary: Whatever tool you use, make sure you find out which case of MySQL overload are you really dealing with as it can help you to focus your efforts and get the best result with the least effort possible. And just because MySQL got more of a load and it caused the overload, and possibly downtime – it does not mean you should focus only on reducing workload back to the previous level. Often enough an increase of the work is the wonderful thing coming from increasing service popularity and as such you need to focus on MySQL being able to handle more work better. Even in this case it is much better to know that nothing “broke” compared to yesterday but rather that there is more optimization work which needs to be done.