Order of Business

The order in which Galera cluster nodes must be started has long been an unjustly controversial topic. From time to time people ask how to restart the cluster in our mailing list. And from time to time they keep on making mistakes, like the one described here.

It happens so that people tend to believe in miracles. Worse, they tend to expect miracles to actually happen. They tend to expect, for example, that Galera cluster should automatically resync itself on startup to the "correct" position regardless of the order in which nodes are started. And are very disappointed to learn that it is not the case and one needs to choose the first node in the cluster with extreme care. After all joining a new node is completely automatic, no?

<intermission>
Considering that Galera cluster was designed to be "highly-available" solution it is surprising how many people want to shut down the whole cluster. Automatic cluster startup was never a design goal here because the cluster is not supposed to be ever shut down.
</intermission>

Let's see why such expectations are completely groundless. For simplicity let's consider a 2-node cluster consisting of Node1 and Node2. Under load. You shut down Node1 at seqno=5. Then you shutdown node2 - it happens at seqno=10. Cluster is no more. Node1 misses transactions 6..10 which are present on Node2.

Then you want to start it again. You have two options. First start node1 and then node2 OR first start node2 and then node1. Let's start with the latter. If you do so you will see something like that in the node1 error log:

Case 1

131009 21:02:00 [Note] WSREP: State transfer required: Group state: d5dbd473-302b-11e3-9298-37194a8fb0de:13 Local state: d5dbd473-302b-11e3-9298-37194a8fb0de:5

"Group state" is the state (identified by global transaction ID) of the group this Node1 has joined, which up to now consisted of Node2. Note seqno 13 - Node2 has already processed some transactions, so Node1 is now missing transactions 6..13 from Node2, while transactions 1..5 are identical to those on Node2. So all we need is to update Node1 with the data it misses. State transfer will happen and Node1 will synchronize with Node2. You did the right thing.

Suppose you did the opposite and started Node1 first. Then, if you're in luck, and the load on the cluster is not that heavy, in the Node2 error log you will see:

Case 2

131009 21:22:53 [ERROR] WSREP: Local state seqno (10) is greater than group seqno (7): states diverged. Aborting to avoid potential data loss. Remove '/dev/shm/galera0/mysql/var//grastate.dat' file and restart if you wish to continue. (FATAL)

Notice that Node1 has already advanced to seqno 7. But transactions 6 and 7 on Node1 ARE NOT THE SAME as transactions 6 and 7 on Node2. Node1 now has some data that Node2 does not. And vice versa. Node states have diverged. The same way that source code trees can diverge in SCM. There is no way to merge them automatically now. And just like any good SCM, instead of unconditional overwriting either Node1 or Node2 state Galera gives you an option to reflect on your actions which led to such situation and decide whether you want to recover any transactions from Node2 or not (it remains an exercise for a curious reader, why I said "Node2" here). Just don't forget that while you're reflecting Node1 keeps on diverging from Node2 more and more. Your "luck" here been that divergence was timely detected and no data was irrecoverably lost and no downtime was caused.

Case 3

If you're not so lucky, and Node1 manages to advance beyond seqno 10 of Node2, there is no way to detect that states have diverged by seqno. From Galera POV it is case 1. (And this is exactly what happened here) Now Node1 will try to deliver IST to Node2. Node2 will fail to apply some transaction because its state is diverged, abort, and in some cases (depending on the complexity of a situation) Node1 may be left in non-primary configuration - just like it happened in that case study above.

The point being: it does not even matter much whether you have detected state divergence early or late - in either case, as soon as you start a cluster under load with a wrong node, you are facing diverged states and either data loss or a very time consuming data recovery. And there is no way on earth or heaven that can automatically "resynchronize" the nodes in this case.

And of course, if you shut down an idle cluster, then it does not matter which order you start the nodes in, since they have the same position - and therefore indistinguishable. And therefore cannot be ordered. So you can start them in any order even under load!

Now some wisenheimers optimists may come out and point out that, hey, what if we shut down the cluster under load, but start up without load. Then there will be no divergence and, hence, it should be possible to resynchronize the nodes to the most updated one when it connects. To them I shall say, that it is a narrow corner case, too expensive to support correctly, for the fact that you're joining a more update node to the less updated node already indicates that you're doing something wrong, at best that there WAS load during cluster shutdown, and then what are the chances that there is no load during startup? In any case, while there is no load you always have an opportunity to restart the cluster with the right node.

And what node that might be one may ask? Besides long and exhaustive explanation here there is a simple rule of a thumb: the one that was last standing in the Primary Component. In the above example this is of course NodeA - the last representative of the initial cluster after NodeB and NodeC went south.