Galera 3.19 comes with improvements to better handle the situation where the entire cluster needs to be restarted. In previous blog post, we already discussed one such feature: Safe-to-Bootstrap.
In this post, we will discuss gcache recovery, which allows for the cluster to be restarted quickly after it was shut down, thus reducing the overall duration of planned or unplanned downtime.
The Gcache and Whole-Cluster Restarts
In previous versions of Galera, When a node starts up, its Gcache is wiped clean. This means that this node can not serve as an IST donor to other nodes that are also starting up at the same time. This is especially important when the entire cluster is being restarted. When the second node starts, it hopes to quickly rejoin the cluster via IST from the first node:
WSREP: State transfer required: Group state: 6c645ece-aff5-11e6-933a-ba70cd73bcf7:1234 Local state: 6c645ece-aff5-11e6-933a-ba70cd73bcf7:1230 WSREP: Gap in state sequence. Need state transfer.
unfortunately, the first node, is unable to satisfy the IST request because it has nothing in its gcache:
WSREP: IST first seqno 1 not found from cache, falling back to SST WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '127.0.0.2:13007/rsync_sst' --socket '/home/philips/git/mysql-wsrep-bugs-5.6/mysql-test/var/tmp/mysqld.1.sock' --datadir '/home/philips/git/mysql-wsrep-bugs-5.6/mysql-test/var/mysqld.1/data/' --defaults-file '/home/philips/git/mysql-wsrep-bugs-5.6/mysql-test/var/my.cnf' --defaults-group-suffix '.1' '' --gtid '6c645ece-aff5-11e6-933a-ba70cd73bcf7:1''
So. a full-blown SST will happen for the second and every subsequent node, which causes unnecessary delays when bringing up the entire cluster. With Galera 3.19, we can do much better.
If the gcache.recover option in wsrep_provider_options is set to yes, Galera will attempt to recover the gcache file to a usable state on startup rather than delete it, thus preserving the ability to have IST. So, when starting the first node of the cluster, we can observe the following output in the log:
[Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: 6c645ece-aff5-11e6-933a-ba70cd73bcf7, offset: 1280 [Note] WSREP: GCache::RingBuffer initial scan (134217768 bytes)... 0.0% (0 bytes) complete. [Note] WSREP: GCache::RingBuffer initial scan (134217768 bytes)... 100.0% (134217768 bytes) complete. WSREP: Recovering GCache ring buffer: found gapless sequence 2-2 WSREP: GCache::RingBuffer unused buffers scan (272 bytes)... 0.0% (0 bytes) complete. WSREP: GCache::RingBuffer unused buffers scan (272 bytes)... 100.0% (272 bytes) complete. WSREP: GCache DEBUG: RingBuffer::recover(): found 0/1 locked buffers WSREP: GCache DEBUG: RingBuffer::recover(): used space: 272/134217728
This node can now provide IST to the second node in the cluster when the second node is started:
WSREP: IST request: 6c645ece-aff5-11e6-933a-ba70cd73bcf7:1-2|tcp://127.0.0.1:13006 WSREP: async IST sender starting to serve tcp://127.0.0.1:13006 sending 2-2 WSREP_SST: [INFO] Bypassing state dump. (20161121 06:31:06.866)
All the remaining nodes in the cluster can also join via IST, drastically reducing the total time needed to bring up the entire cluster. Furthermore, the first node never has to become a SST donor, so it is never blocked or burdened by having to perform the SST operation once for each other node in the cluster.
Galera needs to read the entire gcache file twice during recovery. For the typical gcache sizes we see in production environments, this operation does not take anywhere nearly as much time as the remaining steps involved in restarting the cluster and the overall benefit is substantial.
In Galera 3.19, gcache recovery is a best-effort operation, which will complete successfully in cases where the node were gracefully shut down prior to restarting. Even in the case of a power outage, a monitoring system can be used to shut down the nodes gracefully.
In case of certain types of hard crashes, such as a sudden power loss to the entire cluster, it is possible that Galera will not be able to recover the gcache sufficiently for it to be used to serve IST. In this case, node startup will proceed as usual and SST will happen when the second and subsequent nodes join the cluster.