The Percona Monitoring Plugins (PMP) provide some free tools to make it easier to monitor PXC/Galera nodes.  Monitoring broadly falls into two categories: alerting and historical graphing, and the plugins support Nagios and Cacti, respectively, for those purposes.

Graphing

An update to the PMP this summer (thanks to our Remote DBA team for supporting this!) added a Galera-specific host template that includes a variety of Galera-related stats, including:

  • Replication traffic and transaction counts and average trx size
  • Inbound and outbound (Send and Recv) queue sizes
  • Parallelization efficiency
  • Write conflicts (Local Cert Failures and Brute Force Aborts)
  • Cluster size
  • Flow control

You can see examples and descriptions of all the graphs in the manual.

Alerting

There is not a Galera-specific Nagios plugin in the PMP yet, but there does exist a check that can pretty universally check any status variable you like called pmp-check-mysql-status.  We can pretty easily adapt this to check some key action-worthy Galera stats, but I hadn’t worked out the details until a customer requested it recently.

Checking for a Primary Cluster

Technically this is a cluster or cluster-partition state for whatever part of the cluster the queried node is a part of.  However, any single node could be disconnected from the rest of the cluster, so checking this on each node should be fine.  We can verify this with this check:

Local node state

We also want to verify the given node is ‘Synced’ into the cluster and not in some other state:

Note that we are only  warning when the state is not Synced — this is because it is perfectly valid for a node to be in the Donor/Desynced state.  This warning can alert us to a node in a less-than-ideal state without screaming about it, but you could certainly go critical instead.

Verify the Cluster Size

This is a bit of a sanity check, but we want to know how many nodes are in the cluster and either warn if we’re down a single node or go critical if we’re down more.  For a three node cluster, your check might look like this:

This is OK when we have 3 nodes, warns at 2 nodes and goes critical at 1 node (when we have no redundancy left).   You could certainly adjust thresholds differently depending on your normative cluster size.   This check is likely meaningless unless we’re also in a Primary cluster, so you could set a service dependency on the Primary Cluster check here.

Check for Flow Control

Flow control is really something to keep an eye on in your cluster. We can monitor the recent state of flow control like this:

This warns when FC exceeds 10% and goes critical after 90%.  This may need some fine tuning, but I believe it’s a general principle that some small amount of FC might be normal, but you want to know when it starts to get more excessive.

Conclusion

Alerting with Nagios and Graphing with Cacti tend to work best with per-host checks and graphs, but there are aspects of a PXC cluster that you may want to monitor from a cluster-wide perspective.  However, most of the things that can “go wrong” are easily detectable with per-host checks and you can get by without needing a custom script that is Galera-aware.

I’d also always recommend what I call a “service check” that connects through your VIP or load balancer to ensure that MySQL is available (regardless of underlying cluster state) and can do a query.  As long as that works (proving there is at least 1 Primary cluster node), you can likely sleep through any other cluster event.  🙂

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
AdriannaY

Hello,

I have some problems with “T” options when I run the command :

[root@mypriv-bd3 ~]# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_status -C ‘!=’ -T str -w Synced

I have this error: you specified -T but not -y. Try –help.

ditto when I run : /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary

Is there a configuration setting or a particular change to do?

Thanks in advance.

AdriannaY

Hello Jay;

Thanks for your answer . I had percona-nagios-plugins-1.0.3-1.noarch.rpm on my machine , I installed the specified package now and everything goes well.

Adrianna

Rares Dumitrescu

hi. this is a super necro but :

/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C ‘!=’ -T str -w Synced
ERROR 1682 (HY000) at line 1: Native table ‘performance_schema’.’global_variables’ has the wrong structure
UNK could not get MySQL status/variables.

this is happening on a percona 5.7 installation. mysql_upgrade has been run. i got nothing there.