The Codership Knowledge Base

Article Counts: 10 Troubleshooting, 13 Best Practices; Recent Changes: 6 Revised, 2 New Articles

The Codership Documentation explains in detail how to deploy and administer a Galera Cluster. This section, the Codership Knowledge Base (KB) contains information on resolving problems and improving use of Galera Cluster. Here you’ll find troubleshooting articles and best practices articles.

In essence, the Documentation is how to use Galera assuming everything goes according to plan; the KB is for when things don’t go as expected or when they could be better.

In addition to this KB, you can also post questions on the Codership Forum. The community, as well as our staff monitor and respond to posts made there. If you need more immediate and personalized assistance, you can get a Support contract with us at Codership. For a quote on the cost of support, write us at info@codership.com or use our on-line form to send us a message.

Troubleshooting Articles

This is the Troubleshooting section of the Galera Knowledge Base (KB). It contains information on resolving problems you might experience with Galera Cluster. It includes articles on how to diagnose and address various performance and replication trouble. For articles related to performance and other ways to improve usage of Galera Cluster, see the next section, Best Practices.

Node Crashes during SST

Length: 463 words; Pub: Apr 2014; Revised: Nov 2019

If a Donor Node crashes while using rsync for a state transfers, the Joiner Node may stall with incomplete databases, and be inaccessibe. This article discusses how to resolve this.

SST Fails due to SQL Syntax Errors

Length: 789 words; Pub: Apr 2014; Revised: Nov 2019

When using mysqldump for state transfers for a new node, it may fail. In the database error log, there may be entries that says there were SQL Syntax errors.

Requested State Transfer Failed

Length: 649 words; Pub: Apr 2014; Revised: Nov 2019

When a new node joins a cluster, it will try to get a full copy of the databases from one of the other nodes. Sometimes its request will be ignored and no node is selected to be the donor.

Cluster Stalls on ALTER TABLE

Length: 519 words; Pub: Apr 2014; Revised: Oct 2019:

A cluster will sometimes stall when executing ALTER TABLE on a table with several columns and indexes. Depending on the number of rows, it can be a major drain on performance.

User Changes not Replicating

Length: 518 words; Pub: Apr 2014; Revised: Sep 2019

Changes to the mysql database (e.g., user name, host address) are not replicated on other nodes. This can cause problems for users, as well as frustrate the DBA.

AUTO_INCREMENT Increasing by Multiples

Length: 995 words; Pub: Oct 2019

Key columns using AUTO_INCREMENT will increase values by more than one with Galera. This can be confusing and worrisome, but this article explains why this is a good method.

Deadlock Found during a Transaction

Length: 887 words; Pub: Nov 2019

After starting a transaction involving an SQL statement that changes data, a deadlock error message will be returned indicating another node has already locked the same rows.

Commit Failed for Reason 3

Length: 326 words; Pub: Apr 2014; Revised: Nov 2019

When you have wsrep_debug turned ON, you may occasionally see a message noting that a commit has failed due to reason 3.

Multi-Master Conflicts

Length: 751 words; Pub: Apr 2014

Multi-master clusters have certain types of conflicts and can involve data inconsistencies among nodes. This article explains the nuances of Galera and how to prevent them.

Unknown Command Errors

Length: 971 words; Pub: Apr 2014; Revised: Nov 2019

Instead of receiving results from a valid SQL statement, an error message is returned saying, “Unknown Command” on one node. This error is returned for all queries on the node.

Best Practices Articles

Whereas the Troubleshooting section relates to handling problems with a cluster, this section of the KB provide information and guidance on improving the performance of a cluster and optimizing configuration of the nodes.

Group Commit

Length: 322 words; Pub: May 2019; Revised: Oct 2019

When several transactions try to commit simultaneously, GROUP COMMIT flushes them to the disk with a single system call, rather than a call for each, greatly improving performance.

Large Transactions

Length: 443 words; Pub: Apr 2015

Large transactions can lead to diminished performance. One reason for this is that the table may be reindexing and rescanning after each row is deleted.

Parallel Applier Threads

Length: 366 words; Pub: Jun 2015

Parallel threads don’t ensure better performance, but they don’t impair performance and they may actually increase synchronization of new nodes joining a cluster.

Slow Nodes

Length: 297 words; Pub: Apr 2014

By design, cluster performance won’t be higher than the slowest node. Even with only one node, its performance can be considerably slower compared to stand-alone mode.

WAN Latency

Length: 221 words; Pub: Jun 2015

When using Galera over a WAN, links can have exceptionally high latency. Check this by measuring the Round-Trip Time among nodes, and correct it by adjusting temporal parameters.

WAN Replication

Length: 161 words; Pub: Jun 2015

When running a cluster over a WAN, there may be transient network connectivity failures. To prevent this from partitioning the cluster, try increasing the keep-alive timeouts.

Write-Set Cache Size

Length: 467 words; Pub: Jun 2015

If you have storage issues, there are some guidelines to adjust the gcache.size parameter, properly. You could also change your state snapshot method.

Write-Set Caching during State Transfers

Length: 156 words; Pub: Jun 2015

Galera nodes don’t use much more memory than a stand-alone. The certification index and uncommitted write-sets drain some. Write-set caching during state transfers is the exception.

Single Master Setup

Length: 81 words; Pub: Jun 2015

It’s possible to designate one node in a cluster to handle all writes, to be the master to the other nodes. To do this, there are certain configuratoin requirements.

Multi-Master Setup

Length: 55 words; Pub: Jun 2015

The more masters in a cluster, the higher the probability of certification conflicts. This can lead to undesirable rollbacks and performance degradation.

Two-Node Clusters

Length: 880 words; Pub: Jun 2015; Revised: Nov 2019

There are potential problems with two-node clusters: A split-brain situation may occur. When one node fails, the remaining node becomes non-operational.

Setting an SELinux Policy

Length: 345 words; Pub: Jun 2015; Revised: Oct 2019

When you first installing a node, SELinux will prohibit cluster activities. You will need a SELinux policy so it will recognize cluster activities as legitimate.

Synch a Transaction First

Length: 994 words; Pub: May 2019; Revised: Nov 2019

When entering a transaction, you may need to ensure a previous transaction has been committed on the current node. Synchronization functions can make this easier to do.

Read Master

Traditional MySQL master-slave topology, but with Galera all “slave” nodes are capable masters at all times, it is just the application who treats them as slaves. Galera replication can guarantee 0 slave lag for such installations and due to parallel slave applying, much better throughput for the cluster.

Disaster Recovery

Disaster recovery is a sub-class of WAN replication. Here one data center is passive and only receives replication events, but does not process any client transactions. Such a remote data center will be up to date at all times and no data loss can happen. During recovery, the spare site is just nominated as primary and application can continue as normal with a minimal fail over delay.

Write scalability

Distributing writes across the cluster will harness the CPU power in slave nodes for better use to process client write transactions. Due to the row based replication method, only changes made during a client transaction will be replicated and applying such a transaction in slave applier is much faster than the processing of the original transaction. Therefore the cluster can distribute the heavy client transaction processing across many master nodes and this yields in better write transaction throughput overall.

Latency Eraser

With WAN replication topology, cluster nodes can be located close to cilents.Therefore all read & write operations will be super fast with the local node connection. The RTT related delay will be experienced only at commit time, and even then it can be generally accepted by end user, usually the kill-joy for end user experiences is the slow browsing response time, and read operations are as fast as they possibly can be.

WAN Clustering

Synchronous replication works fine over the WAN network. There will be a delay, which is proportional to the network round trip time (RTT), but it only affects the commit operation.