Node Crashes during SST

Length: 463 words; Published: April 1, 2014; Updated: November 4, 2019; Category: State Transfers; Type: Troubleshooting

When a new node joins a cluster, it will request data from the cluster. One node, known as a donor, will use a State Snapshot Transfer (SST) method to provide a full copy of the data to the new node, known as the joiner. Depending on how the nodes are configured, they will typically use the utilities, mysqldump or rsync to transfer the data. All of this usually works well, but it does not always. This KB article will consider a common scenario in which problems may occur.

Scenario

You can set the wsrep_sst_method option to whatever tool you want to use to make state transfers. DBAs typically set this to mysqldump or rsync. When a new node joins, State Snapshot Transfer begins, and file system processes are started for the tool used (for example, rsync). Below is show the results of running ps on the joining node during a state transfer:

ps -e | grep rsync

14718 ? 00:00:00 wsrep_sst_rsync
14766 ? 00:00:00 rsync
14799 ? 00:00:00 rsync
14800 ? 00:00:00 rsync

If the node crashes before the state transfer is complete, it may cause the process or processing running rsync, or whatever tool you are using, to stall, occupying the port and not allowing you to restart the node. When this happens, the error logs for the database server (that is, /var/log/mysqld.log) will show that the port is in use, although it isn’t. You’ll have to fix this problem.

Solution

There are a few ways you can resolve this situation. The simplest way is to kill the stalled processes. To do this, you will need to know the process identification number. However, first you may want to stop mysqld on the joining node, to start fresh. You could enter something like the following on the stalled node, from the command-line:

systemctl stop mysqld

ps -e | grep rsync

14800 ? 00:06:05 rsync

In the example here, the results show that the process identification number is 14800. Using this information, you might enter the following from the command-line:

kill -9 14800

If there are multiple processes running, which can be the case with rsync, you will have to kill all of them. Sometimes the killall command will suffice:

killall rsync

However, this usually does not work. Instead, you will have to use the kill command for each process. It is tedious, but necessary. Once you have killed the orphaned process, it will free the relevant ports and allow you to start mysqld on the new node.

After restarting the node, if the processes handling the state transfer stalls again, it wasn’t a fluke. There’s a persistent problem with the network or security or something else. Check the server logs and the database logs on the server to determine the cause.