The Percona Managed Services team recently faced a somewhat peculiar client issue. We’d receive pages about their MySQL service being unreachable. However, studying the logs showed nothing out of the ordinary…. for the most part it appeared to be a normal shutdown and there was nothing in anyone’s command history nor a cron task to speak of that was suspicious.

This is one of those obscure and peculiar (read: unique) issues that triggered an old memory; I’d seen this behavior before and I had just the tool to catch the culprit in the act.

Systemtap made diagnostics of this issue possible and I can’t state enough how much of a powerful and often under-utilized tool set systemtap really is.

cat > signals.stp << EOF
probe signal.send {
if (sig_name == “SIGKILL” || sig_name == “SIGTERM”)
printf(“[%s] %s was sent to %s (pid:%d) by %s uid:%dn”,
ctime(gettimeofday_s()), sig_name, pid_name, sig_pid, execname(), uid())
}
EOF

sudo stap ./signals.stp > signals.log 2>signals.err

grep mysqld signals.log
[Wed Jun 11 19:03:23 2014] SIGKILL was sent to mysqld (pid:8707) by cfagent uid:0
[Fri Jun 13 21:37:27 2014] SIGKILL was sent to mysqld (pid:6583) by cfagent uid:0
[Sun Jun 15 05:05:34 2014] SIGKILL was sent to mysqld (pid:19818) by cfagent uid:0
[Wed Jul 9 07:03:47 2014] SIGKILL was sent to mysqld (pid:4802) by cfagent uid:0

Addendum: It had been so long since I had used this tooling that I could not remember the original source from which I derived the module above; some cursory searching to rectify this issue for this blog post found this original source by Eugene Teo of Red Hat made available under GPLv2.

From this we were able to show that cfagent was killing the mysqld process presumably via a misconfigured job; this information was returned to the client and this has continued to be run in production for two months now at the client’s request with no issues to speak of.

This is by no means the limit to what systemtap can be used to achieve; you can hook into functions though whilst you may need to install the debug packages to find what functions are available run for example:

sudo stap -L 'process("/usr/sbin/mysqld").function("*")' > /tmp/mysql_stapfunc
...
head /tmp/mysql_stapfunc
process("/usr/sbin/mysqld").function("TIME_from_longlong_date_packed")
process("/usr/sbin/mysqld").function("TIME_from_longlong_datetime_packed")
process("/usr/sbin/mysqld").function("TIME_from_longlong_time_packed")
process("/usr/sbin/mysqld").function("TIME_set_hhmmss")
process("/usr/sbin/mysqld").function("TIME_set_yymmdd")
process("/usr/sbin/mysqld").function("TIME_to_longlong_date_packed")
process("/usr/sbin/mysqld").function("TIME_to_longlong_datetime_packed")
process("/usr/sbin/mysqld").function("TIME_to_longlong_packed")
process("/usr/sbin/mysqld").function("TIME_to_longlong_time_packed")
process("/usr/sbin/mysqld").function("TIME_to_ulonglong")
...

This is also true of the kernel using sudo stap -L 'kernel.function("*")' > /tmp/kernel_stapfunc however you must be booted into a debug kernel for this to function.

Systemtap is more than a worthy tool to have at your disposal with plenty of examples available.

Finally I invite you to join me July 23 at 10 a.m. Pacific time for my webinar, “

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
nate

Last time I used cfengine (I miss it, currently using chef and am not happy with it), it would email me if the output of a given run differed from the previous run. It would of been easy as hell to spot this specific issue the first time it happened, the logs above point to this not happening every time cfagent ran, usually runs once an hour or so.

Chef by contrast just sort of likes to fail silently.

Daniël van Eeden

Systemtap is great, but I’ve once had an issue where it blocked all systemcalls which MySQL tried to make. That was a kernel bug. So use it with caution.