The Codership Knowledge Base¶
Article Counts: 10 Troubleshooting, 13 Best Practices; Recent Changes: 6 Revised, 2 New Articles
The Codership Documentation explains in detail how to deploy and administer a Galera Cluster. This section, the Codership Knowledge Base (KB) contains information on resolving problems and improving use of Galera Cluster. Here you’ll find troubleshooting articles and best practices articles.
In essence, the Documentation is how to use Galera assuming everything goes according to plan; the KB is for when things don’t go as expected or when they could be better.
In addition to this KB, you can also post questions on the Codership Forum. The community, as well as our staff monitor and respond to posts made there. If you need more immediate and personalized assistance, you can get a Support contract with us at Codership. For a quote on the cost of support, write us at info@codership.com or use our on-line form to send us a message.
Troubleshooting Articles¶
This is the Troubleshooting section of the Galera Knowledge Base (KB). It contains information on resolving problems you might experience with Galera Cluster. It includes articles on how to diagnose and address various performance and replication trouble. For articles related to performance and other ways to improve usage of Galera Cluster, see the next section, Best Practices.
Length: 463 words; Pub: Apr 2014; Revised: Nov 2019
If a Donor Node crashes while using rsync
for a state transfers, the Joiner Node may stall with incomplete databases, and be inaccessibe. This article discusses how to resolve this.
SST Fails due to SQL Syntax Errors
Length: 789 words; Pub: Apr 2014; Revised: Nov 2019
When using mysqldump
for state transfers for a new node, it may fail. In the database error log, there may be entries that says there were SQL Syntax errors.
Requested State Transfer Failed
Length: 649 words; Pub: Apr 2014; Revised: Nov 2019
When a new node joins a cluster, it will try to get a full copy of the databases from one of the other nodes. Sometimes its request will be ignored and no node is selected to be the donor.
Length: 519 words; Pub: Apr 2014; Revised: Oct 2019:
A cluster will sometimes stall when executing ALTER TABLE
on a table with several columns and indexes. Depending on the number of rows, it can be a major drain on performance.
Length: 518 words; Pub: Apr 2014; Revised: Sep 2019
Changes to the mysql
database (e.g., user name, host address) are not replicated on other nodes. This can cause problems for users, as well as frustrate the DBA.
AUTO_INCREMENT Increasing by Multiples
Length: 995 words; Pub: Oct 2019
Key columns using AUTO_INCREMENT
will increase values by more than one with Galera. This can be confusing and worrisome, but this article explains why this is a good method.
Deadlock Found during a Transaction
Length: 887 words; Pub: Nov 2019
After starting a transaction involving an SQL statement that changes data, a deadlock error message will be returned indicating another node has already locked the same rows.
Length: 326 words; Pub: Apr 2014; Revised: Nov 2019
When you have wsrep_debug
turned ON
, you may occasionally see a message noting that a commit has failed due to reason 3
.
Length: 751 words; Pub: Apr 2014
Multi-primary clusters have certain types of conflicts and can involve data inconsistencies among nodes. This article explains the nuances of Galera and how to prevent them.
Length: 971 words; Pub: Apr 2014; Revised: Nov 2019
Instead of receiving results from a valid SQL statement, an error message is returned saying, “Unknown Command” on one node. This error is returned for all queries on the node.
Best Practices Articles¶
Whereas the Troubleshooting section relates to handling problems with a cluster, this section of the KB provide information and guidance on improving the performance of a cluster and optimizing configuration of the nodes.
Length: 322 words; Pub: May 2019; Revised: Oct 2019
When several transactions try to commit simultaneously, GROUP COMMIT
flushes them to the disk with a single system call, rather than a call for each, greatly improving performance.
Length: 443 words; Pub: Apr 2015
Large transactions can lead to diminished performance. One reason for this is that the table may be reindexing and rescanning after each row is deleted.
Length: 366 words; Pub: Jun 2015
Parallel threads don’t ensure better performance, but they don’t impair performance and they may actually increase synchronization of new nodes joining a cluster.
Length: 297 words; Pub: Apr 2014
By design, cluster performance won’t be higher than the slowest node. Even with only one node, its performance can be considerably slower compared to stand-alone mode.
Length: 221 words; Pub: Jun 2015
When using Galera over a WAN, links can have exceptionally high latency. Check this by measuring the Round-Trip Time among nodes, and correct it by adjusting temporal parameters.
Length: 161 words; Pub: Jun 2015
When running a cluster over a WAN, there may be transient network connectivity failures. To prevent this from partitioning the cluster, try increasing the keep-alive timeouts.
Length: 467 words; Pub: Jun 2015
If you have storage issues, there are some guidelines to adjust the gcache.size
parameter, properly. You could also change your state snapshot method.
Write-Set Caching during State Transfers
Length: 156 words; Pub: Jun 2015
Galera nodes don’t use much more memory than a stand-alone. The certification index and uncommitted write-sets drain some. Write-set caching during state transfers is the exception.
Length: 81 words; Pub: Jun 2015
It’s possible to designate one node in a cluster to handle all writes, to be the primary to the other nodes. To do this, there are certain configuration requirements.
Length: 55 words; Pub: Jun 2015
The more primaries in a cluster, the higher the probability of certification conflicts. This can lead to undesirable rollbacks and performance degradation.
Length: 880 words; Pub: Jun 2015; Revised: Nov 2019
There are potential problems with two-node clusters: A split-brain situation may occur. When one node fails, the remaining node becomes non-operational.
Length: 345 words; Pub: Jun 2015; Revised: Oct 2019
When you first installing a node, SELinux will prohibit cluster activities. You will need a SELinux policy so it will recognize cluster activities as legitimate.
Length: 994 words; Pub: May 2019; Revised: Nov 2019
When entering a transaction, you may need to ensure a previous transaction has been committed on the current node. Synchronization functions can make this easier to do.
Read Primary
Traditional MySQL primary-replica topology, but with Galera all replica nodes are capable primaries at all times, it is just the application who treats them as replicas. Galera replication can guarantee 0 replica lag for such installations and due to parallel replica applying, much better throughput for the cluster.
Disaster Recovery
Disaster recovery is a sub-class of WAN replication. Here one data center is passive and only receives replication events, but does not process any client transactions. Such a remote data center will be up to date at all times and no data loss can happen. During recovery, the spare site is just nominated as primary and application can continue as normal with a minimal fail over delay.
Write scalability
Distributing writes across the cluster will harness the CPU power in replica nodes for better use to process client write transactions. Due to the row based replication method, only changes made during a client transaction will be replicated and applying such a transaction in replica applier is much faster than the processing of the original transaction. Therefore the cluster can distribute the heavy client transaction processing across many primary nodes and this yields in better write transaction throughput overall.
Latency Eraser
With WAN replication topology, cluster nodes can be located close to cilents.Therefore all read & write operations will be super fast with the local node connection. The RTT related delay will be experienced only at commit time, and even then it can be generally accepted by end user, usually the kill-joy for end user experiences is the slow browsing response time, and read operations are as fast as they possibly can be.
WAN Clustering
Synchronous replication works fine over the WAN network. There will be a delay, which is proportional to the network round trip time (RTT), but it only affects the commit operation.