Requested State Transfer Failed¶
Length: 649 words; Published: April 1, 2014; Updated: November 6, 2019; Category: State Transfers; Type: Troubleshooting
When a new node joins a cluster, it will try to synchronize with the cluster by getting a full copy of the databases from one of the other nodes. This is known as a State Transfer. It will use a tool like rsync
or mysqldump
, depending on how the wsrep_sst_method option was set. Although this usually works well, sometimes it will fail. This KB article discusses such a situation.
Scenario
Suppose a new node joins a cluster—this is known as a joiner node. This is assuming that the node has in fact joined the cluster, but just hasn’t been able to synchronize the data with the other nodes. When it joins the cluster, it will look for another node, known as a donor, to give it a copy of the databases by the State Snapshot Transfer (SST) method. Normally, this starts almost immediately and is completed fairly quickly, depending on the size of the databases and how busy are the nodes.
Suppose further that an excessive amount of time passes without the SST starting. This can be disconcerting. To see what’s going on, you could check the database server’s error log, on the joiner node. It may contain a message like this:
Node 0 (XXX) requested state transfer from '*any*'.
Selected 1 (XXX) as donor.
This error message indicates that no node was explicitly designated to be the donor node. You may also see this message on the joiner:
Requesting state transfer failed: -11(Resource temporarily
unavailable). Will keep retrying every 1 second(s).
As for the node’s status, if you execute the SHOW STATUS
statement, for the wsrep_local_state_comment
variable, you won’t see the desired Synced
status:
SHOW STATUS LIKE 'wsrep_local_state_comment';
+---------------------------+----------------+
| Variable_name | Value |
+---------------------------+----------------+
| wsrep_local_state_comment | Waiting on SST |
+---------------------------+----------------+
The joiner node will do its duty and continue to retry the state transfer request. However, you may need to intercede to resolve the problem, to get the node synchronized for the cluster.
Solution
Behind the scenes, the Group Communication module will select potential donors based on what it knows about the status of each node. These nodes will have to be in a SYNCED
state. Nodes that have the same gmcast.segment wsrep Provider option are preferred. Otherwise, the joiner will select the first in the list of available synced nodes. If the joiner node can’t find a free node that shows as SYNCED
, though, state transfer will not occur.
The first step to resolving this problem is to determine if the other nods are in fact not synchronized. One way to determine which are synchronized is to execute the following SQL statement on each node:
SHOW STATUS LIKE 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
When you find at least one node that is synchronized, get the node name by executing SHOW VARIABLES
to get the value of wsrep_node_name on each synchronized node, like so:
SHOW VARIABLES LIKE 'wsrep_node_name';
+-----------------+----------+
| Variable_name | Value |
+-----------------+----------+
| wsrep_node_name | galera-2 |
+-----------------+----------+
Using those node names—you can designate more than one—to set the donor on the joiner node. You’d do this by using the SET statement to set the wsrep_sst_donor variable to the synchronized node’s name. Here’s an example of how you might do that:
SET GLOBAL wsrep_sst_donor = 'galera-2,galera-5';
This informs the cluster that one of the nodes named (i.e., galera-2
and galera-5
) should be used as the donor. You would execute it on one of the synchronized nodes. It will be replicated to all of the nodes. Incidentally, it may be set in the configuration file, but that may not be necessary since the state transfer failing might be a temporary problem.
SHOW VARIABLES LIKE 'wsrep_sst_donor';
+-----------------+-------------------+
| Variable_name | Value |
+-----------------+-------------------+
| wsrep_sst_donor | galera-2,galera-3 |
+-----------------+-------------------+
Once you’ve nominated nodes to be donors, assuming the joiner has in fact joined the cluster, initiating state transfer should happen immediately and without any further problem. If it doesn’t, confirm that there aren’t any problems with your network connection. Also, confirm that the needed ports aren’t being blocked by SELinux or a firewall. In particular, make sure port 4568 is open: it’s used for State Snapshot Transfers.
Related Documents