Among the many things that can cause a “server stall” is a long-running transaction. If a transaction remains open for a very long time without committing, and has modified data, then other transactions could block and fail with a lock wait timeout. The problem is, it can be very difficult to find the offending code so that it can be fixed. I see this much too often, and have developed a favorite technique for tracking down what that long-running transaction is doing.

Of course, in some cases it’s actually easy to figure out what the long-running transaction is doing. The most obvious is if it’s a long-running query. If that’s the case, then you’ll see the query in the processlist, and you can track down where it’s coming from in the source code. The problem comes when the transaction remains open, but either it isn’t running queries anymore, or it runs such fast queries that you can’t capture them in the processlist.

The steps I use to debug this problem are as follows. First, I identify which transaction is blocking others. Next, I find out the TCP port for that connection. I then sniff TCP traffic for a while, use mk-query-digest to convert this into a query log, and examine the log.

In MySQL 5.1 and newer with the InnoDB plugin, it’s easy to figure out which transaction is blocking others. There are some INFORMATION_SCHEMA tables that make this simple. Just query the tables, and if something is blocked, find its thread ID. Match this up with the ID in SHOW PROCESSLIST, and look in the Host column; there you’ll see a hostname followed by a port number. That is the port number of the TCP connection that’s to blame for causing the trouble.

If you’re in pre-5.1 or non-plugin InnoDB versions, you can’t reliably figure out what connection is to blame for blocking others. You can guess, and sometimes that’s good enough. Just pull the oldest transaction that holds any locks from SHOW INNODB STATUS and assume that it’s the culprit. This doesn’t always work, and I hate things that sometimes work. They teach you to assume that it’ll always work, which bites you at the worst times. But I digress.

Here is a small script that I use quite a bit. I simply start a screen session and run it in a loop with a 30-second sleep in between. Notice that it self-purges its own data, and it won’t run if the disk is getting full. I’ve learned the hard way a long time ago never to run a script that will let data accumulate or fill up the disk.

If this captures anything, then you can simply run “mk-query-digest –type=tcpdump –no-report –print” on the resulting tcpdump file (after inspecting the captured SHOW INNODB STATUS and SHOW PROCESSLIST to make sure you haven’t caught a false-positive). I also like to use the –timeline feature in mk-query-digest, which prints out a sequence of queries in an easy-to-view format, rather than aggregating the worst queries together.

This has worked for me a few times. Alas, there are some cases where it won’t work. If the offending transaction is idle, and not continuing to execute more queries, then it won’t work. In addition, I will admit that I ran into a case I couldn’t solve even with full server logging. There was a transaction that was blocking others and had locks and undo log records, yet had only run SELECT queries since its inception. That case is still open and I am still stumped as to how a SELECT-only transaction can possibly have undo log records!

The greater moral of this story is — InnoDB is designed for lots of short, small transactions. Don’t design batch jobs that run in loops and update data, unless you make them commit between iterations. And watch your server to find idle-in-transaction connections that are just sitting there doing nothing, and kill them if you can’t make the developers fix them 🙂 I’ve seen dozens of instances of those two types of problems bringing everything on the system to a grinding “lock wait timeout” halt.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Peter Boros

I am using almost the same thing, but in a stored procedure, and I save the results to a table. If it would be excessive, most likely the lock is easy to spot anyway:). This approach misses one feature: you can’t tell what sql executed on a given thread id last time. If you could, the “many weird sleeping threads” problems would be so much easier to diagnose.

Pavel Shevaev

Baron, as for InnoDB being designed for lots of short, small transactions…. what about save points? Do they help in this case? I mean if I have a big fat transaction and if I use save points inside of it periodically, will it make InnoDB happier?

Aaron Brown

As always, your timing is impeccable, Baron :)

Markus Hall

Could also be worth checking your disk space. You get a server stall if the file system holding your database is full, which I don’t think (though not sure) you can tell by looking at the info tables alone.