Monitoring a Galera Cluster

Length: 3100 words; Writer: Russell J.T Dyer: July 17, 2019; Topic: Administration; Level: Intermediate

Galera Cluster is a reliable, stable database replication clustering system. Both MySQL and MariaDB, with the InnoDB storage, utilize Galera for communications between nodes running Linux. Every aspect of such an arrangement is equally dependable for maintenance and availability of data. It’s truly a high-end professional package.

Nevertheless, you should monitor your cluster as an added level of assurance, to maintain a high availability standard – to resolve problems quickly and without loss of data. You should occasionally manually, and continuously by automated means, check the status of your cluster. Additionally, you should check and monitor the state of each node to ensure against problems (i.e., replication lag, network connectivity, etc.).

There are three methods available to monitor cluster activity and replication health: you can regularly query MySQL’s status variables; use customized scripts, which would basically react to changes in status variables; or use a third-party monitoring application, which would also relies on status variables. In essence, you can either check the status variables yourself, or you can automate and record the process by employing a script or some sort of monitoring software to check the status variables and alert you when there’s a problem.

In this article, we’ll look closely at the essential status variables for you to consider and ways to log cluster and node status.

Using Status Variables

In addition to the standard status variables in MySQL you may already monitor, Galera Cluster also provides a set of status variables. They will allow you to check node and cluster states, as well as replication health.

Galera Cluster variables are related to write-set replication and thereby prefixed with wsrep_. To retrieve a list of all of these status variables, you would enter the following SQL statement on each node, using a simple database client, such as mysql:

SHOW GLOBAL STATUS LIKE 'wsrep_%';

+------------------------+--------------------------------------+
| Variable_name          | Value                                |
+------------------------+--------------------------------------+
| wsrep_local_state_uuid | bd5fe1c3-7d80-11e9-8913-4f209d688a15 |
| wsrep_protocol_version | 10                                   |
| ...                    | ...                                  |
| wsrep_thread_count     | 6                                    |
+------------------------+--------------------------------------+

If you’d execute this SQL statement on one of your nodes, you’d see that there are over sixty status variables. Some of them may be of no interest to you – perhaps most – but there are some you should check regularly. You could group these into three basic categories: cluster integrity; node status; and replication health.

A Cluster’s Integrity

A cluster is said to have integrity when each node – all of the nodes in the cluster – receive and replicate write-sets from all of the other nodes. The cluster begins to lose integrity when this situation falters. This can be caused by the cluster going down, becoming partitioned, or if there is a split-brain situation.

The status variables that will reveal whether there is a loss of cluster integrity are the wsrep_cluster_state_uuid, wsrep_cluster_conf_id, wsrep_cluster_size, and the wsrep_cluster_status. Let’s consider each and how it may indicate a problem.

Compare UUIDs

When all nodes are synchronized with each other, they will have executed all of the same transactions. Each transaction includes a UUID to identify it. Therefore, the last UUID on each node should be the same.

To confirm this, execute the following SQL statement on each node to see if the results are the same:

SHOW GLOBAL STATUS LIKE 'wsrep_cluster_state_uuid' \G

*************************** 1. row ***************************
Variable_name: wsrep_cluster_state_uuid
        Value: bd5fe1c3-7d80-11e9-8913-4f209d688a15

If the last node has a different result from the others, it may be that a transaction came through while you were in the process of executing the SQL statement. So, check again, maybe in a different order. But if one or more nodes clearly have different UUIDs than the others, the cluster has no integrity. This means more than one cluster has been formed, and the nodes are not all communicating with each other.

Take Attendance

If there may be a problem with network connectivity or if you think the cluster may have split into separate clusters, check the wsrep_cluster_size on each to see that they agree. If you have five nodes and some of the nodes say the cluster size contains three, while others say two, you have a problem. Any value that doesn’t match the number of nodes you have running suggests there’s a network connectivity problem, or maybe MySQL is down on one node.

However, if only one node is out of sync, you might solve the problem by taking it down, fixing whatever network problem it’s having, and then starting it again. When it properly joins the cluster, it will undergo a State Snapshot Transfer (SST), a full replacement of the databases.

Take a Tally

Another approach to checking cluster integrity is to compare the values of the wsrep_cluster_conf_id status variable on all nodes. This will show the total number of changes that have occurred in the cluster—changes that the node on which it’s executed is aware. Basically, comparing this variable will determine whether a node is a part of the Primary Component.

SHOW GLOBAL STATUS LIKE 'wsrep_cluster_conf_id';

+-----------------------+--------+
| Variable_name         | Value  |
+-----------------------+--------+
| wsrep_cluster_conf_id | 82     |
+-----------------------+--------+

Each node in the cluster should provide the same value. Otherwise, it indicates that the cluster is partitioned. This is not good. If this value is some outrageously high number (e.g., in excess of a trillion), it may indicate that the nodes are dropping and restarting themselves over and over.

Each Node’s Status

In addition to checking cluster integrity, you should also monitor the status of individual nodes—as in, not necessarily in relation to the cluster as a whole.

Basically, you would look to see whether a node received and processed updates from the cluster write-sets. There are a few status variables that will give such insights: wsrep_ready; wsrep_connected; and wsrep_local_state_comment.

Ready & Connected

The first two status variables are pretty straightforward: they’re either ON or OFF. If wsrep_ready returns OFF, it’s not ready and almost all queries will fail. You’ll receive error messages like this one:

ERROR 1047 (08501) Unknown Command

When wsrep_connected returns a value OFF, the node doesn’t have a connection to any other nodes or cluster components. The reason for lost connection could be more physical (i.e., the network is down, a cable is disconnected, etc.). Or it could be that the node’s configuration file is incorrect or inconsistent with the other nodes.

For instance, the values of the wsrep_cluster_address and wsrep_cluster_name parameters may be entered incorrectly in the MySQL configuration file. The error log should provide details to help troubleshoot the problem. This is usually, /var/log/mysqld.log—or whatever the value is for log_error variable.

Easily Understood

To make the node status much clearer, you can check the value of the wsrep_local_state_comment status variable. Its value will be easy to understand.

SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';

+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+

That’s pretty clear—Synced—and reassuring.

When a node is part of the Primary Component, it will return Joining, Waiting on SST, Joined, Synced or Donor. If you don’t like the results you get, try again. It changes quickly and generally won’t take long to get to Synced. If a node is part of a non-operational component, though, it will return Initialized. If it stays that way, it might be a problem.

Replication Health

Monitoring cluster integrity and node status can show issues that may prevent or otherwise block replication. These status variables will help in identifying performance issues and identifying problem areas so that you can get the most from your cluster.

So that things don’t get too hectic for a node, Galera will trigger a feedback mechanism called, Flow Control to manage the replication process. When there are too many write-sets in the queue, the node engages Flow Control to pause replication until it can get caught up.

The status variables you’d check for this are wsrep_local_recv_queue_avg, wsrep_flow_control_paused, and wsrep_cert_deps_distance. Unlike the previously mentioned status variables, these are variables reset when the servers are restarted or the FLUSH STATUS statement is executed.

Bunching of Writes

The wsrep_local_recv_queue_avg variable shows the average size of the local received queue since the last status query. When this is greater than 0, it indicates that the node can’t apply write-sets as fast as it’s receiving them. If you’re detecting a problem here, you might also check wsrep_local_recv_queue_min and wsrep_local_recv_queue_max to get a range of values, rather than just the average.

In addition to checking the node’s status related to incoming write-sets, it could check how outgoing connectivity is looking. Mainly, you would check the wsrep_local_send_queue_avg variable to get an average of the send queue length since the last time the status variables were flushed. However, sending is rarely a bottleneck.

SHOW STATUS LIKE 'wsrep_local_send_queue_avg';

+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| wsrep_local_send_queue_avg | 0.145000 |
+----------------------------+----------+

A value greater than 0 indicates replication throttling or network throughput issues. It could be the physical network cards and cables, or the operating system’s configuration. Similar to the received queue above, you can check the wsrep_local_send_queue_min and wsrep_local_send_queue_max status parameters to see the range, and not just the average.

Flow Control Paused

If you sense a node is getting overwhelmed, you might execute FLUSH STATUS on it and then check the value of the wsrep_flow_control_paused variable—after waiting a bit for a better sample. It will return the percentage of time the node was paused because of Flow Control since you just flushed the status.

SHOW STATUS LIKE 'wsrep_flow_control_paused';

+---------------------------+----------+
| Variable_name             | Value    |
+---------------------------+----------+
| wsrep_flow_control_paused | 0.184353 |
+---------------------------+----------+

In the results here, it shows that for a little more than 18 percent of the time elapsed, the replication was paused. A value of 1 would indicate that the node was paused 100% of the time. Anything greater than 0 indicates the node’s replication health may be weak. You should closely monitor it—flushing occasionally—until you start seeing 0 values. If it doesn’t resolve itself, you might increase the number of replica threads (i.e., wsrep_slave_threads).

Note

If you use MySQL-wsrep 8.0.26 or newer, use wsrep_applier_threads instead of wsrep_slave_threads.

Sequentially in Parallel

Last, you might monitor wsrep_cert_deps_distance. It will tell you the average distance between the lowest and highest sequence number, values a node can potentially apply in parallel.

Basically, this is the optimal value to set wsrep_slave_threads or wsrep_applier_threads, since it’s pointless to assign more replica threads than the number of transactions that can be applied in parallel.

Utilizing Server Logs to Troubleshoot

As you can see, the status variables provide you with plenty of information for detecting problems. However, they don’t generally indicate a pattern—they’re mostly the current state when you happen to look. Historical information, though, can make it easier to see a problem developing. Additionally, the status variables do little to help you to determine the cause of problems, or provide you with recommendations on how to solve them.

For seeing a pattern, you’ll have to record the results from querying the status variables at regular intervals, recording them in a database or a log for later review. For consistency of intervals, it should be automated. You could either write your own scripts to do this, or you could use one of the many database monitoring programs (e.g., Monyog).

Enabling the Error Log & Special Logging

For determining the cause of a problem, the server logs are generally the most helpful. Use SHOW VARIABLES to check the value of the log_error variable—and determine the path and name of the log file. If it returns nothing, you’ll need to enable it by adding log-error to the MySQL configuration file. It will set the path and file name on its own.

In addition to the information recorded in the error log, there are parameters and options you can use to enable error logging on events specific to replication: wsrep_log_conflicts, cert.log_conflicts, and wsrep_debug. Setting these will cause MySQL to record information about conflicts in the replication process.

The wsrep_log_conflicts parameter enables conflict logging for error logs. For instance, it will record when two nodes attempt to write to the same row in the same table at the same time. It will do this even if this conflict is resolved before it can be committed. Without logging this information, you would be unaware that there was temporarily a conflict.

The cert.log_conflicts is a wsrep Provider option that enables logging of certification failures during replication.

The wsrep_debug parameter enables debugging information, providing much more verbose entries in the log files. However, this parameter can also cause the database server to record passwords and similar authentication data to the error logs. Don’t enable it in production environments since it’s a security vulnerability.

Below is how these entries would look in the MySQL configuration file:

wsrep_log_conflicts=ON
wsrep_provider_options="cert.log_conflicts=ON"
wsrep_debug=ON

There is one more type of log you should check. When a node is unable to complete a transaction or some other event, the database server will create a special binary log file with details of that failure. This file is placed in the data directory and is named something like, GRA_*.log. You should periodically see if these log files are generated. When they are, review them right away.

Notification Command

Although checking status variables and logs will provide you information you’ll need, retrieving and reviewing such information is a manual process. Plus, you may have to examine status variables and logs on each determine and resolve a problem. This is one of the appealing aspects of third-party monitoring software.

To assist you in monitoring a cluster and its nodes, Galera includes a mechanism for alerting you of a problem. To make use of it, you’ll need to create a script—or copy someone else’s script—that will process values passed to it from Galera. Then you have to set the wsrep_notify_cmd parameter with the path and name of the script—put this in the MySQL configuration file.

Galera will call the script and pass a set of values to it whenever a node joins or leaves the cluster, and whenever the cluster or node’s status changes. Your script can then send you an alert, log the data it receives in a table or a log file—this is a way to accumulate data for determining a pattern we just mentioned—or adjusting traffic flow through a load balancer.

Notification Script Example

When a change occurs in a node or the cluster and triggers the notification script or command, it will pass certain parameters to the script. Of particular interest are the --status and --members parameters. The status will be that of the node on which the script is running. It will indicate, among other things, if the node is synchronized or not. See the [Documentation on Notification Status](https://galeracluster.com/library/documentation/notification-cmd.html#node-status) for a list of all values.

Below is a very simple bash script that will serve as a notification command. It collects only some of the information available and records it to a log file, with labels.

#!/bin/bash

log_file='/var/log/galera-node-monitor.log'

while [ $# -gt 0 ]
   do
   case $1 in
      --status)
         node_status=$2
         shift
         ;;
      --members)
         members=$2
         shift
         ;;
         esac
         shift
   done

declare idx=0
declare -a node_names

for node in $(echo $members | sed s/,/\ /g)
  do
    node_name=$(echo "'$node'" | sed  s/\\//,/g| cut -d',' -f 2)
    node_names+=($node_name)

    idx=$(( $idx + 1 ))
  done

if [ -z "${idx}" ];
  then
     idx=0
fi

node_names=( $(printf "%s\n" ${node_names[@]} | sort ) )
node_name=(`grep wsrep_node_name /etc/my.cnf`)
node_name=${node_name:17:7}

echo "Cluster Size: $idx nodes" >> $log_file
echo "Cluster Members:" ${node_names[@]} | sort -g >> $log_file
echo "Node Name: $node_name" >> $log_file
echo "Node Status: $node_status" >> $log_file
echo "----------------------" >> $log_file

exit

To keep this script simple, it parses only two of the parameters that are passed to it. It manipulates that data a little bit, and then writes it to a log file in the data directory. To initiate this script, you’ll have to use touch to start that log file and then change the ownership to mysql.

A more useful version of this script would include code which sends you an email message if the node’s status is disconnected. Again, we wanted to keep this script simple as an example. The result of it would look like this excerpt below from the log file it appends as events happen:

----------------------
Cluster Size: 3 nodes
Cluster Members: galera1 galera2 galera3
Node Name: galera1
Node Status: synced
----------------------

This entry shows three nodes are running and lists their names—there are in fact only three nodes in this cluster. It also shows that the notification script was run on the galera1 node and that node is synchronized.

The next two entries show that mysqld was shut down on this node.

----------------------
Cluster Size: 1 nodes
Cluster Members: galera1
Node Name: galera1
Node Status: disconnecting
----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: disconnected
----------------------

Notice that the number of nodes is now at 1, although the other two nodes are operating fine and maintaining the cluster. This is because it’s no longer in communication with the cluster.

The next set of entries below reflect mysqld starting again. Notice here that after being connected, it becomes a joiner, as well as other steps to become synchronized.

----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: connected
----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: joiner
----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: initializing
----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: initialized
----------------------
Cluster Size: 0 nodes
Cluster Members:
Node Name: galera1
Node Status: joined
----------------------
Cluster Size: 3 nodes
Cluster Members: galera1 galera2 galera3
Node Name: galera1
Node Status: synced
----------------------

You would have to copy this script to each node and set it to run with the wsrep_notify_cmd parameter on each. The problem with this approach is that the data will be in separate logs.

A better solution would be to have the script connect with the database and insert these log entries into a table. Remember, entries made on one table are made on all and thereby joined together as part of the replication process. However, Galera seems to trip over itself when the notification command tries to replicate its own writes. It results in the nodes becoming non-operational and out-of-sync. An alternative would be to create a table on each node that doesn’t use the InnoDB storage engine (e.g., use a MyISAM table). These tables would be unique to each node and not replicated, but they wouldn’t choke Galera. You could write another script—activated instead by cron—that would query the table on each node to produce reports and alerts. You could be alerted by email or some other method. It’s a little cumbersome, but it works.

Reading GRA_*.log Files

Your data directory may contain log files starting with GRA_. These files are related to replication failures, whenever a node fails to apply an event on a replica node. The database server creates a special binary log file of the event in the data directory. For each GRA_ file, there is a corresponding warning or error message in the mysql error log file.

To view the contents of these files, you can use strings or view it with mysqlbinlog. See below for an example output:

[root@galerasf mysql]# strings GRA_2_123440_v2.log
binM
8.0.26
CREATE UNDO TABLESPACE undo_003 ADD DATAFILE 'undo_003.ibu'

[root@galerasf mysql]# mysqlbinlog GRA_2_123440_v2.log
# The proper term is pseudo_replica_mode, but we use this compatibility alias
# to make the statement usable on server versions 8.0.24 and older.
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#230713  6:35:57 server id 1  end_log_pos 0  Start: binlog v 4, server v 8.0.26 created 230713  6:35:57 at startup
ROLLBACK/*!*/;
BINLOG '
TZuvZA8BAAAAeQAAAAAAAAAAAAQAOC4wLjI2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAABNm69kEwANAAgAAAAABAAEAAAAYQAEGggAAAAICAgCAAAACgoKKioAEjQA
CigAbVYcMQ==
'/*!*/;
# at 125
#230713  6:35:57 server id 1  end_log_pos 0  Query   thread_id=2256014       exec_time=0     error_code=0
SET TIMESTAMP=1689230157/*!*/;
SET @@session.pseudo_thread_id=2256014/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1168113696/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8mb4 *//*!*/;
SET @@session.character_set_client=255,@@session.collation_connection=255,@@session.collation_server=255/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
/*!80011 SET @@session.default_collation_for_utf8mb4=255*//*!*/;
CREATE UNDO TABLESPACE undo_003 ADD DATAFILE 'undo_003.ibu'
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

GRA_ files are for troubleshooting purposes only, and are not automatically cleaned up. Once you have identified if they represent a problem or not, you can manually delete them.

Conclusion

With busy and large databases, keeping them running smoothly and consistently can be a little intimidating. However, Galera provides plenty of information for you to be able to monitor the status of each node and the cluster. You need only develop a habit of checking, or a system to check automatically and with regularity. Plus, it provides a method of reacting to changes in node and cluster status.

Yes, you’ll need to know how to read the warning signs and know what to do to resolve problems before they affect the entire cluster, but the sooner you are made aware of a situation developing, the better and less stressful it will be for you.