In this post, I’ll cover the new MySQL monitoring plugins we created for Nagios, and explain their features and intended purpose.

I want to add a little context. What problem were we trying to solve with these plugins? Why yet another set of MySQL monitoring plugins?

The typical problem with Nagios monitoring (and indeed with most other monitoring systems) is the alert spam. I’ve spoken to a lot of users with monitoring systems over the years, and everyone says that they get a lot of false alarms, and that they have email filters set up for alerts. Recently I spoke to one person who gets about thirty thousand alerts per week (!!!) in his email. I’m sure you can see that this makes the monitoring system much less effective. The usual cause of such noisy monitoring is when you alert on things that don’t really indicate problems.

To avoid this problem as much as possible, I suggest the following:

  1. Alert only on things that reliably indicate a real problem. Do not alert on a threshold that merely indicates something is abnormal; abnormal things happen constantly. It’s useful to spend time investigating abnormalities, but in general, you should try to make sure that doesn’t creep into your fault-detection alerts.
  2. Alert only on things that are actionable. If you get an alert and you can’t do anything about it, that’s a certain recipe for a /dev/null email filter.
  3. Avoid ratios and thresholds as much as possible, because they are rarely actionable or reliable indicators of problems, and it’s difficult or impossible to find the right threshold for all situations. This suggestion is a consequence of the previous two, but it’s worth calling out separately in my opinion. I’ll highlight a couple of threshold-based alerts later in this blog post, and explain the need to be careful with them.

Although we’re not finished with the plugins, we have a usable starting point, and we’ll build on it as time passes. The plugins we created are based on our analysis of several hundred customer downtime incidents, and their causes and preventions. You can read more about that research in our white papers, here and here.

If you read those white papers, you’ll see a pretty extensive list of suggestions for what to monitor in MySQL. Those are only suggestions; any given deployment might want to monitor only some of them, as appropriate, and might need to include others as well. Here’s how you can use our new monitoring plugins to check for some of the conditions that caused downtime for our customers:

  • Failed or full LVM snapshots. LVM backup strategies can fail to delete snapshots after finishing, sometimes due to the backup script failing before completion. Additionally, LVM snapshots that don’t have enough copy-on-write space can fail before the backup completes. The pmp-check-lvm-snapshots plugin will alert when a snapshot is failed or overly full.
  • Too many deadlocks. Deadlocks (and lock wait timeouts, covered in another plugin) are frequent causes of serious errors in applications because the applications often don’t handle deadlocks correctly, and in many cases also don’t log the resulting errors. The best place to fix these problems is in the application, but the DBA really needs to be monitoring it as well, because an application error can creep in easily and be deployed, and you will not find it anywhere but in production. The pmp-check-mysql-deadlocks plugin can help you detect such problems, in conjunction with Percona Toolkit’s pt-deadlock-logger tool.
  • Open but deleted filehandles. It’s quite easy for a bad logrotate script (such as the one that ships by default with many server installations) or a system administration mistake to delete one of MySQL’s log or data files from disk. The database server will continue to function, however, because the file isn’t really deleted. The MySQL server keeps an open filehandle to the file, and it won’t be deleted until the server restarts, so the problem remains undetected while it becomes more and more severe. This can cause two serious types of problems: the disk can fill up with no apparent cause or solution, or you can restart the server and lose your data permanently. The pmp-check-mysql-deleted-files plugin will alert when files have been deleted but remain open via a filehandle.
  • Wrong filesystem privileges. It’s very easy for a file to creep into MySQL’s data directory with wrong permissions, without causing problems until a later date. For example, if a database isn’t owned by MySQL, you won’t have any problems until you try to create or drop a table, or the database itself. Likewise, if a system administrator leaves a temporary file in the database directory and then you try to drop the database, it will fail. These things tend to happen at inconvenient times. The pmp-check-mysql-file-privs plugin checks whether MySQL owns every file and directory within its data directory.
  • InnoDB is getting stuck, has long-running transactions, or other problems. The pmp-check-mysql-innodb plugin checks for several indicators of severe internal problems within InnoDB, such as contention that crosses a tipping point and cripples the server. Sometimes such conditions are like credit card debt: easy to start, impossible to dig out afterwards. Another very common cause of serious incidents is application misbehavior, such as opening a transaction but never committing it, which creates a lot of bloat or locking in the database server. This plugin can alert on that and a couple other types of common problems, too.
  • Missing PID file. A variety of problems can delete MySQL’s PID file, such as a buggy init script that attempts to start the server when it’s already running, or executing the server by hand instead of through the init script. Overly zealous cron jobs to delete old files in /tmp can do the same thing. If the PID file is missing, then many init scripts will behave very badly. Sample problems include starting two instances of MySQL on the same files in the data directory, and refusing to shut down MySQL so that it is killed forcibly (i.e. crashed) instead of gracefully shutting down for a system reboot. The pmp-check-mysql-pidfile plugin verifies that MySQL’s PID file exists.
  • Evidence of contention in the processlist. The pmp-check-mysql-processlist plugin checks SHOW FULL PROCESSLIST for patterns that indicate similar problems to those discussed within InnoDB, but it can find evidence of problems in the server as well as the storage engine. For example, a pile-up of connections all trying to open or close tables can indicate LOCK_open contention.
  • Replication is delayed or stopped. Our plugins are structured to monitor as few things as possible, but as many as needed. Replication is one thing that needs to be monitored in two dimensions, which shouldn’t be combined: delay (Seconds_behind_master or the delay indicated by the pt-heartbeat tool, if you choose), and actual failure. The pmp-check-mysql-replication-delay plugin checks for excessive delay, and pmp-check-mysql-replication-running checks that replication is functioning correctly.
  • Replicas have different data than their master. In addition to delay or failure, replication can fail silently and data can diverge or drift on a replica. The pt-table-checksum tool from Percona Toolkit can generate table checksums, and the pmp-check-pt-table-checksum plugin can alert when those checksums indicate that the replica is not logically identical to its master.
  • The server is allocating too much memory. A runaway server process, or poorly optimized queries, can consume too much memory and cause the server to swap until it is unresponsive and has to be power cycled. In addition, other processes can be runaways, too, such as a poorly written script or daemon; it is not unheard of for a monitoring agent or RAID utility software to do this. The pmp-check-unix-memory plugin checks for a single process consuming too much memory and/or for too little free memory in general. This plugin requires careful configuration because it uses a threshold for which there is no good one-size-fits-all value. This plugin, perhaps more than any other, should only be used judiciously.
  • The server has been restarted recently. It is surprising how often a server crashes, perhaps even cycling through crashes, and is not detected until sometime later when the problem has grown much more serious. You can use the pmp-check-mysql-status plugin to alert when the server’s Uptime variable is too small.
  • The server is approaching max_connections. Overflowing the max_connections setting is a bit like a lack of disk space: it is possible for it to be slowly approached over time, and then to begin causing problems when connections are rejected. Similarly to deadlocks and timeouts, applications often don’t handle or log this error appropriately. The pmp-check-mysql-status plugin can be configured to alert when Threads_connected approaches max_connections too closely, which can give the system administrator time to address the situation before it becomes a problem. This check requires careful configuration, or it too can become a noise generator. However, if your application environment makes it difficult to handle max_connections problems (the use of fixed-size connection pooling, for example, which can sometimes require application downtime to resolve), then it might be worth considering to let you deal with the problem when it’s convenient instead of when it’s urgent.

In addition to the checks in that list, there are several others that can be performed with existing plugins, so there was no need to write new ones. For example, the plugins that ship with Nagios are suitable for checking that the server is alive and responds to a query.

The pmp-check-mysql-status is also very flexible and can be used to trigger an alert on almost anything from SHOW VARIABLES and SHOW STATUS that you want. If you want to monitor ratios and rates, for example, you can certainly do so; it also supports an incremental mode of operation, so you can (for example) get the incremental increase in a variable such as Queries and alert on a queries-per-second threshold. (I don’t suggest doing this; I’m only listing an example of what you can do. The documentation has several other examples.)

In future releases we will include more plugins to monitor more types of problems, but the current plugins already cover some of the most common and severe conditions that we’ve observed to cause actual production downtime. Please download and try out the plugins, and help us improve them. If you are a customer and you need help configuring any of the plugins, we are happy to assist. In addition, please ask our sales staff to include a free monitoring setup in a new support contract. I believe they will be happy to negotiate this with you.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Edmar

Absolutely great, plugins and explanation and reasoning. Thanks!

Sheeri Cabral

The explanation is great, and I am glad you clarified what the plugins do (although it is pretty clear if folks looked at the plugins set), because they do things very differently from Shlomi’s mycheckpoint and the PalominoDB monitoring tool.

However, I do have to say that your experience greatly differs from mine, as you say “Avoid ratios and thresholds as much as possible, because they are rarely actionable or reliable indicators of problems.” Percona does consulting for particular types of clients, and that experience – while EXTREMELY valuable – does not account for the majority of problems.

It’s a subtle difference, but saying this would be much better: “Be careful of ratios and thresholds, if you do alert on them make sure they are actionable and reliable indicators of problems.” Otherwise people end up turning your sentence into “Avoid the ratios and thresholds” and they don’t actually think critically about what they are doing.

I have, in fact, found that some ratios and thresholds are absolutely reliable indicators of problems, and many are actionable. Sometimes they are indicative of interim or recurring problems, and while temporary in nature, it is useful to know that issues are cropping up again.

In addition, threshold/ratio checks can be used for data gathering – just because you can alert with Nagios does not mean you have to. Many many times I have put a Nagios check in place for a ratio or threshold (say, long-running query) just to get some data (for example, how many times per day a query runs for more than 10 minutes on a DW server). The checks themselves might be at a different interval than the standard 5-minute check. I love the PnP4Nagios plugin, it makes cacti-style graphs out of Nagios performance data.

In many scenarios, this kind of data helps build a case for getting developers to do something at a higher priority, or getting more budget. So I think more emphasis should be on “if you do choose to monitor ratios/thresholds, think about what purpose the monitoring serves” – and that goes for whether the monitoring is alerts or graphing.

Rodalpho

Generally speaking, when most people talk about monitoring, they’re referring to alerts coming in for something going wrong. What Sheeri is talking about is usually referred to as reporting. Nagios data can be used for reporting, but it’s best suited for monitoring.

MySQL reporting is fairly limited, because enabling the performance schema incurs such a significant performance impact. Hopefully that will improve over time. I would absolutely LOVE to leave the performance schema enabled everywhere, but I can’t justify that until the impact is under 5%.

MySQL alerting isn’t great either compared to something like Oracle with interval views for /everything/, but capabilities improved immensely as the information schema expanded over the years.

I work at a MSP doing this sort of thing for a living and Baron’s post here is dead-on. “All alerts must be actionable” should be inscribed in stone tablets on a mountaintop. Ignore at your peril.

Claudio Nanni

Another milestone in MySQL history!
And we actually were building our Nagios checks, thanks Baron!

Claudio Nanni

I will be more than happy to share my ideas, experience and scripts

pickup.lichun

Can I get alert when there are 20 threads which running over 5 second using percona monitor plugin. We have to get alert if there are concurrency problem.

mandm

HI baron, Thanks for a great post,
i was trying to implement a pmp-check-mysql-status plugin for server uptime is small, as it will notify us when the server crashes

and i try these parameters
-x Uptime -c 5
and my output is
CRIT Uptime = 4951 | Uptime=4951;;5;0;

so how can i change it to alert only when the output is small as you mentioned in the post?

mandm

ok nevermind got my answer

-x Uptime -C ‘<' -c 10

Eric

Hi Baron, I was hoping you could comment on Rodalpho’s note re performance impact. Reading the docs currently the claim is that it is “small”, but some more detailed measurement would be really helpful.