Node Failure & Recovery

Individual nodes fail to operate when they lose touch with the cluster. This can occur due to various reasons. For instance, in the event of hardware failure or software crash, the loss of network connectivity or the failure of a state transfer. Anything that prevents the node from communicating with the cluster is generalized behind the concept of node failure. Understanding how nodes fail will help in planning for their recovery.

Detecting Single Node Failures

When a node fails the only sign is the loss of connection to the node processes as seen by other nodes. Thus nodes are considered failed when they lose membership with the cluster’s Primary Component. That is, from the perspective of the cluster when the nodes that form the Primary Component can no longer see the node, that node is failed. From the perspective of the failed node itself, assuming that it has not crashed, it has lost its connection with the Primary Component.

Although there are third-party tools for monitoring nodes—such as ping, Heartbeat, and Pacemaker—they can be grossly off in their estimates on node failures. These utilities do not participate in the Galera Cluster group communications and remain unaware of the Primary Component.

If you want to monitor the Galera Cluster node status poll the wsrep_local_state status variable or through the Notification Command.

For more information on monitoring the state of cluster nodes, see the chapter on Monitoring the Cluster.

The cluster determines node connectivity from the last time it received a network packet from the node. You can configure how often the cluster checks this using the evs.inactive_check_period parameter. During the check, if the cluster finds that the time since the last time it received a network packet from the node is greater than the value of the evs.keepalive_period parameter, it begins to emit heartbeat beacons. If the cluster continues to receive no network packets from the node for the period of the evs.suspect_timeout parameter, the node is declared suspect. Once all members of the Primary Component see the node as suspect, it is declared inactive—that is, failed.

If no messages were received from the node for a period greater than the evs.inactive_timeout period, the node is declared failed regardless of the consensus. The failed node remains non-operational until all members agree on its membership. If the members cannot reach consensus on the liveness of a node, the network is too unstable for cluster operations.

The relationship between these option values is:

evs.keepalive_period <= evs.inactive_check_period
evs.inactive_check_period <= evs.suspect_timeout
evs.suspect_timeout <= evs.inactive_timeout
evs.inactive_timeout <= evs.consensus_timeout

Note

Unresponsive nodes that fail to send messages or heartbeat beacons on time—for instance, in the event of heavy swapping—may also be pronounced failed. This prevents them from locking up the operations of the rest of the cluster. If you find this behavior undesirable, increase the timeout parameters.

Cluster Availability vs. Partition Tolerance

Within the CAP theorem, Galera Cluster emphasizes data safety and consistency. This leads to a trade-off between cluster availability and partition tolerance. That is, when using unstable networks, such as WAN, low evs.suspect_timeout and evs.inactive_timeout values may result in false node failure detections, while higher values on these parameters may result in longer availability outages in the event of actual node failures.

Essentially what this means is that the evs.suspect_timeout parameter defines the minimum time needed to detect a failed node. During this period, the cluster is unavailable due to the consistency constraint.

Recovering from Single Node Failures

If one node in the cluster fails, the other nodes continue to operate as usual. When the failed node comes back online, it automatically synchronizes with the other nodes before it is allowed back into the cluster.

No data is lost in single node failures.

../_images/training.jpg

State Transfer Failure

Single node failures can also occur when a state snapshot transfer fails. This failure renders the receiving node unusable, as the receiving node aborts when it detects a state transfer failure.

When the node fails while using mysqldump, restarting may require you to manually restore the administrative tables. For the rsync method in state transfers this is not an issue, given that it does not require the database server to be in an operational state to work.