Monitoring PXC and Galera with Percona Monitoring Plugins

The Percona Monitoring Plugins (PMP) provide some free tools to make it easier to monitor PXC/Galera nodes. Monitoring broadly falls into two categories: alerting and historical graphing, and the plugins support Nagios and Cacti, respectively, for those purposes.

Graphing

An update to the PMP this summer (thanks to our Remote DBA team for supporting this!) added a Galera-specific host template that includes a variety of Galera-related stats, including:

Replication traffic and transaction counts and average trx size
Inbound and outbound (Send and Recv) queue sizes
Parallelization efficiency
Write conflicts (Local Cert Failures and Brute Force Aborts)
Cluster size
Flow control

You can see examples and descriptions of all the graphs in the manual.

Alerting

There is not a Galera-specific Nagios plugin in the PMP yet, but there does exist a check that can pretty universally check any status variable you like called pmp-check-mysql-status. We can pretty easily adapt this to check some key action-worthy Galera stats, but I hadn’t worked out the details until a customer requested it recently.

Checking for a Primary Cluster

Technically this is a cluster or cluster-partition state for whatever part of the cluster the queried node is a part of. However, any single node could be disconnected from the rest of the cluster, so checking this on each node should be fine. We can verify this with this check:

$ /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
OK wsrep_cluster_status (str) = Primary | wsrep_cluster_status=Primary;;non-Primary;0;

1 2	$ /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary OK wsrep_cluster_status (str) = Primary \| wsrep_cluster_status=Primary;;non-Primary;0;

Local node state

We also want to verify the given node is ‘Synced’ into the cluster and not in some other state:

/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C '!=' -T str -w Synced
OK wsrep_local_state_comment (str) = Synced | wsrep_local_state_comment=Synced;;Synced;0;

1 2	/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C '!=' -T str -w Synced OK wsrep_local_state_comment (str) = Synced \| wsrep_local_state_comment=Synced;;Synced;0;

Note that we are only warning when the state is not Synced — this is because it is perfectly valid for a node to be in the Donor/Desynced state. This warning can alert us to a node in a less-than-ideal state without screaming about it, but you could certainly go critical instead.

Verify the Cluster Size

This is a bit of a sanity check, but we want to know how many nodes are in the cluster and either warn if we’re down a single node or go critical if we’re down more. For a three node cluster, your check might look like this:

# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_size -C '<=' -w 2 -c 1
OK wsrep_cluster_size = 3 | wsrep_cluster_size=3;2;1;0;

1 2	# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_size -C '<=' -w 2 -c 1 OK wsrep_cluster_size = 3 \| wsrep_cluster_size=3;2;1;0;

This is OK when we have 3 nodes, warns at 2 nodes and goes critical at 1 node (when we have no redundancy left). You could certainly adjust thresholds differently depending on your normative cluster size. This check is likely meaningless unless we’re also in a Primary cluster, so you could set a service dependency on the Primary Cluster check here.

Check for Flow Control

Flow control is really something to keep an eye on in your cluster. We can monitor the recent state of flow control like this:

/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9
OK wsrep_flow_control_paused = 0.000000 | wsrep_flow_control_paused=0.000000;0.1;0.9;0;

1 2	/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9 OK wsrep_flow_control_paused = 0.000000 \| wsrep_flow_control_paused=0.000000;0.1;0.9;0;

This warns when FC exceeds 10% and goes critical after 90%. This may need some fine tuning, but I believe it’s a general principle that some small amount of FC might be normal, but you want to know when it starts to get more excessive.

Conclusion

Alerting with Nagios and Graphing with Cacti tend to work best with per-host checks and graphs, but there are aspects of a PXC cluster that you may want to monitor from a cluster-wide perspective. However, most of the things that can “go wrong” are easily detectable with per-host checks and you can get by without needing a custom script that is Galera-aware.

I’d also always recommend what I call a “service check” that connects through your VIP or load balancer to ensure that MySQL is available (regardless of underlying cluster state) and can do a query. As long as that works (proving there is at least 1 Primary cluster node), you can likely sleep through any other cluster event. 🙂

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

AdriannaY

10 years ago

Hello,

I have some problems with “T” options when I run the command :

[root@mypriv-bd3 ~]# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_status -C ‘!=’ -T str -w Synced

I have this error: you specified -T but not -y. Try –help.

ditto when I run : /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary

Is there a configuration setting or a particular change to do?

Thanks in advance.

Jay Janssen

Author

10 years ago

AdriannaY:

It works for me:

[root@node1 ~]# /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C ‘!=’ -T str -w Synced
OK wsrep_local_state_comment (str) = Synced | wsrep_local_state_comment=Synced;Synced;;0;

Make sure you are running the latest version of the plugins, it’s possible that this flag was modified in recent releases:

[root@node1 ~]# rpm -qa | grep nagios
percona-nagios-plugins-1.0.5-1.noarch

AdriannaY

10 years ago

Hello Jay;

Thanks for your answer . I had percona-nagios-plugins-1.0.3-1.noarch.rpm on my machine , I installed the specified package now and everything goes well.

Adrianna

Rares Dumitrescu

6 years ago

hi. this is a super necro but :

/usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_local_state_comment -C ‘!=’ -T str -w Synced
ERROR 1682 (HY000) at line 1: Native table ‘performance_schema’.’global_variables’ has the wrong structure
UNK could not get MySQL status/variables.

this is happening on a percona 5.7 installation. mysql_upgrade has been run. i got nothing there.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Percona XtraDB Cluster/ Galera with Percona Monitoring Plugins

Graphing

Alerting

Checking for a Primary Cluster

Local node state

Verify the Cluster Size

Check for Flow Control

Conclusion

Related

Related Blog Articles

RECOMMENDED ARTICLES

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

Securing Your MySQL Database: Essential Best Practices

Troubleshooting PostgreSQL on Kubernetes With Coroot

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Percona XtraDB Cluster/ Galera with Percona Monitoring Plugins

Graphing

Alerting

Checking for a Primary Cluster

Local node state

Verify the Cluster Size

Check for Flow Control

Conclusion

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

Securing Your MySQL Database: Essential Best Practices

Troubleshooting PostgreSQL on Kubernetes With Coroot

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation