Optimized State Snapshot Transfers in a WAN Environment

Introduction

Galera Cluster is a robust product that allows you to create geo-distributed database clusters where the nodes are located in geographically separate locations. WAN links tend to be slow while database sizes tend to grow, and the disparity may become painfully obvious in cases where the entire dataset needs to be shipped over the network.

If a node joins the cluster either for the first time or after a period of prolonged downtime, it may need to obtain a complete snapshot of the database from some other node. This operation is called State Snapshot Transfer or SST, and is often reasonably quick in a LAN environment.

In a geo-distributed cluster, however, the dataset may need to travel over a slow WAN link. A transfer that takes seconds over a 10Gb network can take hours over a cable modem.

SST does not happen during the normal operation of the cluster, but may be needed during an outage situation which is already a stressful time for the DevOps. During SST, the joining node is not available and the donating node may be in a read-only state or have degraded performance.

Therefore, it is important to make sure the operation is as fast as possible or avoided altogether. In this post, we will describe how to achieve that.

A Note on Dataset Size

Even if you think your logical database size is manageable, you may be surprised that a MySQL data directory has grown way beyond your expectations. This will affect the performance of SST methods that operate on physical files, such as rsync and xtrabackup-v2, so it is important to monitor the size of the directory as matter of routine.

Reasons for having a large physical data directory include:

Running certain DDLs on large tables or making table copies via CREATE ... SELECT, INSERT ... SELECT and similar statements. Unless innodb_file_per_table is in effect, such operations will cause the ibdata file to grow, but it will not shrink after the operation is complete or the backup is deleted.
Long-running InnoDB transactions. If there are open long-running (on the order of hours and days) transactions, InnoDB will be forced to keep a snapshot of the data as it existed at the time the transaction was started, which can cause the ibdata root files to grow even in the presence of the innodb_file_per_table setting.
Large BLOBs stored in the database. Surprisingly often, applications are tempted to store binary data in the database (sometimes without the DBA expecting it).

If you consistently find yourself having a database size that is considerably larger than a logical database dump as produced by the mysqldump utility, you may wish to measure the performance of the mysqldump SST method for your particular dataset and give it a try.

Avoiding WAN Traffic Altogether

The best way to avoid having the entire dataset shipped over the WAN is to avoid SST happening in the first place by giving Galera a sufficient gcache.size. For example, a gcache size of 1Gb means that a node will be able to rejoin the cluster without an SST if no more than 1Gb worth of updates have happened on the other nodes while it was away. In this case, Galera substitutes Incremental State Transfer (IST) for SST and only the recent database updates are shipped over the network. The binlog_row_image option can also be used to fit more data within the same gcache size.

The second-best way is to have more than one node at each physical location. If another local node is available, it can be the preferred donor for data in case of an SST. By setting appropriate values for gmcast.segment, you communicate the physical layout of your cluster to Galera so that it can make smarter choices. Having multiple nodes at each physical location also provides other operational advantages and therefore should always be considered when planning a database cluster.

Use Compression

Compression is a very effective way to reduce the size of the data that needs to be transferred and has considerable positive impact on the total time it takes to perform the SST.

Compression is enabled differently based on the SST method:

mysqldump

To get compression with mysqldump SST, add the following line to the [client] section of my.cnf:


[client]
compress=TRUE

rsync

rsync compression does not seem to be configurable via a configuration file, so a small patch of the default rsync SST script is in order so that the --compress option is passed to the rsync client:


diff --git a/scripts/wsrep_sst_rsync.sh b/scripts/wsrep_sst_rsync.sh
index d6dd04f..38ec8bd 100644
--- a/scripts/wsrep_sst_rsync.sh
+++ b/scripts/wsrep_sst_rsync.sh
@@ -158,7 +158,7 @@ then

         # first, the normal directories, so that we can detect incompatible protocol
         RC=0
-        rsync --owner --group --perms --links --specials \
+        rsync --compress --owner --group --perms --links --specials \
               --ignore-times --inplace --dirs --delete --quiet \
               $WHOLE_FILE_OPT "${FILTER[@]}" "$WSREP_SST_OPT_DATA/" \
               rsync://$WSREP_SST_OPT_ADDR >&2 || RC=$?
@@ -181,7 +181,7 @@ then
         fi

         # second, we transfer InnoDB log files
-        rsync --owner --group --perms --links --specials \
+        rsync --compress --owner --group --perms --links --specials \
               --ignore-times --inplace --dirs --delete --quiet \
               $WHOLE_FILE_OPT -f '+ /ib_logfile[0-9]*' -f '- **' "$WSREP_LOG_DIR/" \
               rsync://$WSREP_SST_OPT_ADDR-log_dir >&2 || RC=$?
@@ -199,7 +199,7 @@ then
         [ "$OS" == "Darwin" -o "$OS" == "FreeBSD" ] && count=$(sysctl -n hw.ncpu)

         find . -maxdepth 1 -mindepth 1 -type d -print0 | xargs -I{} -0 -P $count \
-             rsync --owner --group --perms --links --specials \
+             rsync --compress --owner --group --perms --links --specials \
              --ignore-times --inplace --recursive --delete --quiet \
              $WHOLE_FILE_OPT --exclude '*/ib_logfile*' "$WSREP_SST_OPT_DATA"/{}/ \
              rsync://$WSREP_SST_OPT_ADDR/{} >&2 || RC=$?

You can even copy the modified script under a different name, such as wsrep_sst_rsync_wan, make it executable, and set wsrep_sst_method to rsync_wan so that Galera can find it. If you name the script precisely wsrep_ssr_rsync_wan, this will also disable rsync’s --whole-file option so that rsync will only transfer the modified portions of files, further improving bandwidth efficiency.

xtrabackup

xtrabackup SST can use third-party stream compression utilities, such as gzip by putting the appropriate lines in my.cnf:


[SST]
compressor=gzip
decompressor='gzip -dc'

Conclusion

Sending a terabyte of data over a 1 megabit connection will never be fun, but with proper understanding of SST and how to configure it, it is possible to have a smoothly-running geo-distributed cluster even if the dataset is large and the network is slow.

Galera Cluster

Advanced open source database cluster

Galera Cluster – Now Part of MariaDB!