The Percona Monitoring Plugins (PMP) provide some free tools to make it easier to monitor PXC/Galera nodes. Monitoring broadly falls into two categories: alerting and historical graphing, and the plugins support Nagios and Cacti, respectively, for those purposes.
- Replication traffic and transaction counts and average trx size
- Inbound and outbound (Send and Recv) queue sizes
- Parallelization efficiency
- Write conflicts (Local Cert Failures and Brute Force Aborts)
- Cluster size
- Flow control
You can see examples and descriptions of all the graphs in the manual.
There is not a Galera-specific Nagios plugin in the PMP yet, but there does exist a check that can pretty universally check any status variable you like called pmp-check-mysql-status. We can pretty easily adapt this to check some key action-worthy Galera stats, but I hadn’t worked out the details until a customer requested it recently.
Checking for a Primary Cluster
Technically this is a cluster or cluster-partition state for whatever part of the cluster the queried node is a part of. However, any single node could be disconnected from the rest of the cluster, so checking this on each node should be fine. We can verify this with this check:
$ /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
OK wsrep_cluster_status (str) = Primary | wsrep_cluster_status=Primary;;non-Primary;0;
Local node state
We also want to verify the given node is ‘Synced’ into the cluster and not in some other state:
/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C '!=' -T str -w Synced
OK wsrep_local_state_comment (str) = Synced | wsrep_local_state_comment=Synced;;Synced;0;
Note that we are only warning when the state is not Synced — this is because it is perfectly valid for a node to be in the Donor/Desynced state. This warning can alert us to a node in a less-than-ideal state without screaming about it, but you could certainly go critical instead.
Verify the Cluster Size
This is a bit of a sanity check, but we want to know how many nodes are in the cluster and either warn if we’re down a single node or go critical if we’re down more. For a three node cluster, your check might look like this:
# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_size -C '<=' -w 2 -c 1
OK wsrep_cluster_size = 3 | wsrep_cluster_size=3;2;1;0;
This is OK when we have 3 nodes, warns at 2 nodes and goes critical at 1 node (when we have no redundancy left). You could certainly adjust thresholds differently depending on your normative cluster size. This check is likely meaningless unless we’re also in a Primary cluster, so you could set a service dependency on the Primary Cluster check here.
Check for Flow Control
Flow control is really something to keep an eye on in your cluster. We can monitor the recent state of flow control like this:
/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9
OK wsrep_flow_control_paused = 0.000000 | wsrep_flow_control_paused=0.000000;0.1;0.9;0;
This warns when FC exceeds 10% and goes critical after 90%. This may need some fine tuning, but I believe it’s a general principle that some small amount of FC might be normal, but you want to know when it starts to get more excessive.
Alerting with Nagios and Graphing with Cacti tend to work best with per-host checks and graphs, but there are aspects of a PXC cluster that you may want to monitor from a cluster-wide perspective. However, most of the things that can “go wrong” are easily detectable with per-host checks and you can get by without needing a custom script that is Galera-aware.
I’d also always recommend what I call a “service check” that connects through your VIP or load balancer to ensure that MySQL is available (regardless of underlying cluster state) and can do a query. As long as that works (proving there is at least 1 Primary cluster node), you can likely sleep through any other cluster event. :)