August 1, 2014

How Percona diagnoses MySQL server stalls

We receive many requests for help with server stalls. They come under various names: lockup, freeze, sudden slowdown. When something happens only once or twice a day, it can be difficult to catch it in action. Unfortunately, this often leads to trial-and-error approaches, which can drag on for days (or even months), and cause a lot of harm due to the “error” part of “trial-and-error.” At Percona we have become skilled at diagnosing these types of problems, and we can solve many of them quickly and conclusively with no guesswork. The key is to use a logical approach and good tools.

The process is straightforward:

  1. Determine what conditions are observably abnormal when the problem occurs.
  2. Gather diagnostic data when the conditions occur.
  3. Analyze the diagnostic data. The answer will usually be obvious.

Step 1 is usually pretty simple, but it’s the most important to get right. I estimate that about 80% of the cases I work on show up as abnormally high values of Threads_connected or Threads_running. That’s a good place to start for finding a symptom that can trigger the problem. If you aren’t sure what “normal” looks like, and therefore what’s abnormal, then gather samples of things you can observe (such as SHOW STATUS snapshots) and establish both baseline behavior and deviations from it.

The hard part used to be step 2. However, over time the Percona team members have built a good set of tools for doing this on Linux. It is no harder in principle on other OSes, but we have so many customers on Linux that we end up scripting mostly Linux-centric things, so a little tweaking to the tools might be needed on other platforms. The rest of this blog post is about how to set up and use the tools.

The tools are all part of the Aspersa toolkit. There are three primary tools for troubleshooting intermittent problems: stalk, collect, and sift. As the names suggest, they are used to watch the server, gather diagnostic data when the condition occurs, and help you look through it quickly (there’s a lot of it). You can get the tools easily. The best thing to do is put them into root’s home directory; they will gather the most data possible if they’re run as root. Running under sudo is an option too. Assuming root, here’s how to do it:

Now you need to edit the stalk file and tell it your definition of “abnormal” — by default it has some vanilla settings that aren’t useful. (There is no such thing as a good default, sorry.) Let’s assume that your server normally has only a few Threads_running, and when the problem happens, it spikes up into the range of 20 or higher. A safe trigger might be 15, if your server never reaches 15 under normal conditions. The more carefully you determine normal versus abnormal, the higher quality your results will be. Find and change the following lines:

You need to change the threshold to 15, and the variable to Threads_running. Check the other settings, but this is often all you need. Finally, make sure that the script can log into MySQL (if necessary, put a password into .my.cnf, or in the MYSQLOPTIONS variable). Now you can simply start running bin/stalk in a screen session. It shouldn’t do anything but print a timestamp line every 30 seconds. Come back later and see what it found, or set the EMAIL setting to notify you.

What’s so special about the collect tool? Isn’t [insert your favorite OS's awesome tool that Linux doesn't have] better? What’s nice about collect is that it gathers many different types of data, including different angles on the same data. No single tool, no matter how powerful, is an answer for every problem. For example, the collect tool makes it easy to gather snapshots of waiting (stack traces) and CPU consumption (oprofile). That’s important because sometimes the problem is because of blocking and waiting to do work, and sometimes the problem is doing too much work without blocking. Likewise, it gathers lots of I/O data, because sometimes the problem is that the disks are not responding within reasonable and expected tolerances, and sometimes the disks are responding very well but the server is abusing them and overloading them. And sometimes, of course, the CPUs and disks have nothing to do with anything, which is why the collect tool also gathers information about network connections, and so on. If you assume a particular type of trouble in advance, you’ll be wrong sometimes, and it’s back to trial-and-error. You need to gather all the data, and then look at it and see which bits are diagnostic and which bits show nothing relevant.

After you’ve gathered some data, you need to look through it. It is stored in /root/collected/ by default. Use the “sift” tool to examine it. I’ll demo on my laptop:

It’s prompting you to choose which sample of data to analyze. There are 7 samples, and it chooses the most recent by default, so you can just hit the Return key and start browsing the files. The tool is keystroke-driven, Vim-style, so you can navigate the samples with the ‘j’ and ‘k’ keys. I won’t repeat the official documentation here. Give it a try and see how it works; press the ‘?’ key for a list of the keystroke commands you can use. If you find that one of the samples looks suspicious, you can drill into it with a few of the other Aspersa tools, or ask sift to start ‘less’ on a sample’s files so you can look at them. You can analyze the files without sift, but it saves a lot of typing, especially when you have dozens or hundreds of samples to examine.

The sift tool isn’t the final step in analysis. By default it just shows you a few of the highest-level summary tidbits of information, and you often really need to drill into a particular sample’s files and analyze it to see what’s going on. Sometimes it’s glaringly obvious even at a high level, but not always. Most of the time the problem is clear upon a closer inspection, though.

If the answer is not obvious, then maybe you have a false positive and didn’t really catch the problem in action; or collect isn’t gathering the information needed to diagnose (pretty unlikely — it gathers a ton of stuff); or you don’t know how to interpret what you have found. Within Percona, the best course of action is often to ask a more senior person to help. I ask people such as Peter, Vadim, and Yasufumi for help all the time. If you’re not a Percona support/consulting staff member, then of course you can hire us.

If I get a few spare hours, I will put together some screencasts illustrating how to use these tools. I think that’ll be an easier and more fun way to learn them. I have some interesting cases that I can use as demos, after I scrub them for sensitive details. When I finish these, I’ll post them on Percona.TV, our MySQL video/screencast blog site.

Another interesting topic is how to examine a system that appears to be fine and determine whether it is experiencing any very short stalls that you are not aware of. There are many things that can cause a stall in MySQL, and they usually begin microscopically and get worse over time, sometimes abruptly worse. I’ll try to write more about this as time permits, too.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. I think some more maths applied to mysql(et al) status data would be really an interesting field. To try to foresee the trend of a system. It’s quite complex but worth at least thinking about it.
    I hope you find some spare time!

  2. Bill G. says:

    We had mysql lockups on a server with 96GB ram. We had allocated an 8GB query cache – and that was a problem. Every two or three hours mysql would lock up for 20 minutes. Our server was high transaction volume. And this affected all tables – Innodb & myisam. Disabling the query cache eliminated the problem. This was in mysql for centos 5.5 stock version, and in mariadb 5.2

    I hope this helps others with mysql lockups – try disabling your query cache and see if the problem goes away!

  3. I have a very interesting story about MySQL deadlocks.
    I will share here only the results of a long long investigation.
    The ‘deadly’ combination: MyISAM mutex on the key cache, FreeBSD scheduler, and multiple cores (on different CPUs!) (AMD)
    MySQL would go (once a week) into a deadlock collapsing the whole application cluster,
    two apache webservers (mod_php) and itselft.
    At the moment of the deadlock the most evident OS parameter was the CPU context switches,
    from 10-20k to 300-400k and hundreds of processes in the queue waiting (load reported hundreds)
    Doing many tests at the end the temporary solution was to shut down 14 out of 16 cores on the system.
    The server with only 2 cores active was doing much better than with all 16 cores.
    IMPORTANT: note that what resulted the best combination for cores was to have 2 cores in the SAME cpu. I imagined the best was 4 cores , 1 per cpu, WRONG. if you use cores on different cpus it is BAD, context swiches are much higher.
    Also creating one key cache for each high concurrency table and preloading indexes helped lower the contention.
    Moving to linux (centos) allowed to go back to 16 cores.

  4. Bill G, you advised people to use trial-and-error after reading an article that shows an easy way to measure what’s wrong. What if disabling the query cache causes a huge performance problem? There are many other equally likely causes; tweaking the query cache at random is a terrible idea. Measure, don’t guess.

    Claudio, that is very interesting — it sounds like a bug, not just an expected performance problem.

  5. Baron, then it is a FreeBSD bug! All mysql processes stuck in ‘umtxn’ status. I did many tests enabling all different combinations of cores in the kernel, rebooting the server. With FreeBSD 7.0 the only way to avoid the deadlock that was affecting the company since more than one year was to keep 2 cores on in the same CPU. Afterwards migrating to CentOS we could use all the 16 cores. I also asked for a FreeBSD 8 machine to test its new scheduler, but actually did not help (minimal). I really believe the combination FreeBSD scheduler + MySQL mutexes + Multicores (+ crappy php code := loop&query instead of WHERE IN) was explosive, I lost many nights sleep due to it.
    Also splitting the key cache was a great thing, thanks to your (bless you) mk-query-digest I took the most used/concurrent tables and assigned dedicated key cache preloading it at startup. (benchmarks I did: http://caligula.mysql.name/core_tests.txt)
    For the CPU thing probably the cost of switching between cpus is higher than switching cores in the same cpu.

  6. Definitely sounds like a FreeBSD bug. I’ll remember this for the future, but I haven’t seen it yet. The “umtxn” state seems like a frequent problem; seems that FreeBSD SMP support still isn’t quite mature yet.

Speak Your Mind

*