.. meta:: :title: Galera Cluster Knowledge Base :description: :language: en-US :keywords: :copyright: Codership Oy, 2014 - 2024. All Rights Reserved. .. container:: left-margin .. container:: left-margin-top :doc:`The Library <../index>` .. container:: left-margin-content - :doc:`Documentation <../documentation/index>` .. cssclass:: here - :doc:`Knowledge Base <./index>` - :doc:`Training <../training/index>` .. cssclass:: sub-links - :doc:`Training Courses <../training/courses/index>` - :doc:`Tutorial Articles <../training/tutorials/index>` - :doc:`Training Videos <../training/videos/index>` - :doc:`FAQ <../faq>` - :ref:`search` .. container:: top-links - `Home `_ - :doc:`Docs <../documentation/index>` .. cssclass:: here - :doc:`KB <./index>` .. cssclass:: nav-wider - :doc:`Training <../training/index>` - :doc:`FAQ <../faq>` .. cssclass:: library-index .. _`kb`: ============================= The Codership Knowledge Base ============================= .. rst-class:: list-stats Article Counts: 10 Troubleshooting, 13 Best Practices; Recent Changes: 6 Revised, 2 New Articles :doc:`The Codership Documentation <../documentation/index>` explains in detail how to deploy and administer a Galera Cluster. This section, the Codership Knowledge Base (KB) contains information on resolving problems and improving use of Galera Cluster. Here you will find troubleshooting articles and best practices articles. In essence, the Documentation is how to use Galera assuming everything goes according to plan; the KB is for when things do not go as expected or when they could be better. In addition to this KB, you can also post questions on the `Codership Forum `_. The community, as well as our staff monitor and respond to posts made there. If you need more immediate and personalized assistance, you can get a Support contract with us at Codership. For a quote on the cost of support, write us at info@codership.com or use `our on-line form `_ to send us a message. .. _`kb-trouble`: ------------------------- Troubleshooting Articles ------------------------- This is the Troubleshooting section of the Galera Knowledge Base (KB). It contains information on resolving problems you might experience with Galera Cluster. It includes articles on how to diagnose and address various performance and replication trouble. For articles related to performance and other ways to improve usage of Galera Cluster, see the next section, :ref:`Best Practices `. .. _`kb-trouble-state-transfers`: .. container:: banner .. rst-class:: section-heading .. rubric:: State Transfers .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Node Crashes during SST ` .. rst-class:: list-stats Length: 463 words; Pub: Apr 2014; Revised: Nov 2019 .. rst-class:: list-abstract If a :term:`Donor Node` crashes while using ``rsync`` for a state transfers, the :term:`Joiner Node` may stall with incomplete databases, and be inaccessibe. This article discusses how to resolve this. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`SST Fails due to SQL Syntax Errors ` .. rst-class:: list-stats Length: 789 words; Pub: Apr 2014; Revised: Nov 2019 .. rst-class:: list-abstract When using ``mysqldump`` for state transfers for a new node, it may fail. In the database error log, there may be entries that says there were SQL Syntax errors. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Requested State Transfer Failed ` .. rst-class:: list-stats Length: 649 words; Pub: Apr 2014; Revised: Nov 2019 .. rst-class:: list-abstract When a new node joins a cluster, it will try to get a full copy of the databases from one of the other nodes. Sometimes its request will be ignored and no node is selected to be the donor. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Gaps in GTID Sequence Numbers ` .. rst-class:: list-stats Length: 421 words; Pub: Oct 2024; Revised: Oct 2024 .. rst-class:: list-abstract GTID gaps may occur if ``gtid_next`` is set to a specific value, and a transaction fails. .. _`kb-trouble-sql-syntax`: .. container:: banner .. rst-class:: section-heading .. rubric:: Schema & SQL .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Cluster Stalls on ALTER TABLE ` .. rst-class:: list-stats Length: 519 words; Pub: Apr 2014; Revised: Oct 2019: .. rst-class:: list-abstract A cluster will sometimes stall when executing ``ALTER TABLE`` on a table with several columns and indexes. Depending on the number of rows, it can be a major drain on performance. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`User Changes not Replicating ` .. rst-class:: list-stats Length: 518 words; Pub: Apr 2014; Revised: Sep 2019 .. rst-class:: list-abstract Changes to the ``mysql`` database (for example, user name, host address) are not replicated on other nodes. This can cause problems for users, as well as frustrate the DBA. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`auto-increment-multiples` .. rst-class:: list-stats Length: 995 words; Pub: Oct 2019 .. rst-class:: list-abstract Key columns using ``AUTO_INCREMENT`` will increase values by more than one with Galera. This can be confusing and worrisome, but this article explains why this is a good method. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`deadlock-found` .. rst-class:: list-stats Length: 887 words; Pub: Nov 2019 .. rst-class:: list-abstract After starting a transaction involving an SQL statement that changes data, a deadlock error message will be returned indicating another node has already locked the same rows. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`resolve-commit-failure` .. rst-class:: list-stats Length: 326 words; Pub: Apr 2014; Revised: Nov 2019 .. rst-class:: list-abstract When you have ``wsrep_debug`` turned ``ON``, you may occasionally see a message noting that a commit has failed due to reason ``3``. .. _`kb-trouble-splits-topology`: .. container:: banner .. rst-class:: section-heading .. rubric:: Splits & Topology .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Multi-Primary Conflicts ` .. rst-class:: list-stats Length: 751 words; Pub: Apr 2014 .. rst-class:: list-abstract Multi-primary clusters have certain types of conflicts and can involve data inconsistencies among nodes. This article explains the nuances of Galera and how to prevent them. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Unknown Command Errors ` .. rst-class:: list-stats Length: 971 words; Pub: Apr 2014; Revised: Nov 2019 .. rst-class:: list-abstract Instead of receiving results from a valid SQL statement, an error message is returned saying, "Unknown Command" on one node. This error is returned for all queries on the node. .. _`kb-best`: ------------------------ Best Practices Articles ------------------------ Whereas the :ref:`Troubleshooting ` section relates to handling problems with a cluster, this section of the KB provide information and guidance on improving the performance of a cluster and optimizing configuration of the nodes. .. _`kb-best-performance`: .. container:: banner .. rst-class:: section-heading .. rubric:: Performance .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Group Commit ` .. rst-class:: list-stats Length: 322 words; Pub: May 2019; Revised: Oct 2019 .. rst-class:: list-abstract When several transactions try to commit simultaneously, ``GROUP COMMIT`` flushes them to the disk with a single system call, rather than a call for each, greatly improving performance. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Large Transactions ` .. rst-class:: list-stats Length: 443 words; Pub: Apr 2015 .. rst-class:: list-abstract Large transactions can lead to diminished performance. One reason for this is that the table may be reindexing and rescanning after each row is deleted. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Parallel Applier Threads ` .. rst-class:: list-stats Length: 366 words; Pub: Jun 2015 .. rst-class:: list-abstract Parallel threads do not ensure better performance, but they do not impair performance and they may actually increase synchronization of new nodes joining a cluster. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Slow Nodes ` .. rst-class:: list-stats Length: 297 words; Pub: Apr 2014 .. rst-class:: list-abstract By design, cluster performance won't be higher than the slowest node. Even with only one node, its performance can be considerably slower compared to stand-alone mode. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`WAN Latency ` .. rst-class:: list-stats Length: 221 words; Pub: Jun 2015 .. rst-class:: list-abstract When using Galera over a WAN, links can have exceptionally high latency. Check this by measuring the Round-Trip Time among nodes, and correct it by adjusting temporal parameters. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`WAN Replication ` .. rst-class:: list-stats Length: 161 words; Pub: Jun 2015 .. rst-class:: list-abstract When running a cluster over a WAN, there may be transient network connectivity failures. To prevent this from partitioning the cluster, try increasing the *keep-alive* timeouts. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Write-Set Cache Size ` .. rst-class:: list-stats Length: 467 words; Pub: Jun 2015 .. rst-class:: list-abstract If you have storage issues, there are some guidelines to adjust the ``gcache.size`` parameter, properly. You could also change your state snapshot method. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Write-Set Caching during State Transfers ` .. rst-class:: list-stats Length: 156 words; Pub: Jun 2015 .. rst-class:: list-abstract Galera nodes do not use much more memory than a stand-alone. The certification index and uncommitted write-sets drain some. Write-set caching during state transfers is the exception. .. _`kb-best-topology`: .. container:: banner .. rst-class:: section-heading .. rubric:: Topology .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Single Primary Setup ` .. rst-class:: list-stats Length: 81 words; Pub: Jun 2015 .. rst-class:: list-abstract It is possible to designate one node in a cluster to handle all writes, to be the primary to the other nodes. To do this, there are certain configuration requirements. .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Multi-Primary Setup ` .. rst-class:: list-stats Length: 55 words; Pub: Jun 2015 .. rst-class:: list-abstract The more primaries in a cluster, the higher the probability of certification conflicts. This can lead to undesirable rollbacks and performance degradation. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Two-Node Clusters ` .. rst-class:: list-stats Length: 880 words; Pub: Jun 2015; Revised: Nov 2019 .. rst-class:: list-abstract There are potential problems with two-node clusters: A split-brain situation may occur. When one node fails, the remaining node becomes non-operational. .. _`kb-best-security`: .. container:: banner .. rst-class:: section-heading .. rubric:: Security & SQL .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Setting an SELinux Policy ` .. rst-class:: list-stats Length: 345 words; Pub: Jun 2015; Revised: Oct 2019 .. rst-class:: list-abstract When you first installing a node, SELinux will prohibit cluster activities. You will need a SELinux policy so it will recognize cluster activities as legitimate. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: :doc:`Synch a Transaction First ` .. rst-class:: list-stats Length: 994 words; Pub: May 2019; Revised: Nov 2019 .. rst-class:: list-abstract When entering a transaction, you may need to ensure a previous transaction has been committed on the current node. Synchronization functions can make this easier to do. .. _`kb-best-use-cases`: .. container:: banner .. rst-class:: section-heading .. rubric:: Use Cases .. container:: list-col1 .. rst-class:: rubric-2 list-sub-header .. rubric:: Read Primary .. rst-class:: list-abstract Traditional MySQL primary-replica topology, but with Galera all replica nodes are capable primaries at all times, it is just the application who treats them as replicas. Galera replication can guarantee 0 replica lag for such installations and due to parallel replica applying, much better throughput for the cluster. .. rst-class:: rubric-2 list-sub-header .. rubric:: Disaster Recovery .. rst-class:: list-abstract Disaster recovery is a sub-class of WAN replication. Here one data center is passive and only receives replication events, but does not process any client transactions. Such a remote data center will be up to date at all times and no data loss can happen. During recovery, the spare site is just nominated as primary and application can continue as normal with a minimal fail over delay. .. rst-class:: rubric-2 list-sub-header .. rubric:: Write scalability .. rst-class:: list-abstract Distributing writes across the cluster will harness the CPU power in replica nodes for better use to process client write transactions. Due to the row based replication method, only changes made during a client transaction will be replicated and applying such a transaction in replica applier is much faster than the processing of the original transaction. Therefore the cluster can distribute the heavy client transaction processing across many primary nodes and this yields in better write transaction throughput overall. .. container:: list-col2 .. rst-class:: rubric-2 list-sub-header .. rubric:: Latency Eraser .. rst-class:: list-abstract With WAN replication topology, cluster nodes can be located close to cilents.Therefore all read & write operations will be super fast with the local node connection. The RTT related delay will be experienced only at commit time, and even then it can be generally accepted by end user, usually the kill-joy for end user experiences is the slow browsing response time, and read operations are as fast as they possibly can be. .. rst-class:: rubric-2 list-sub-header .. rubric:: WAN Clustering .. rst-class:: list-abstract Synchronous replication works fine over the WAN network. There will be a delay, which is proportional to the network round trip time (RTT), but it only affects the commit operation. .. toctree:: :maxdepth: 2 :hidden: node-crash-during-sst sst-fails-sql-syntax-error requested-state-transfer-failed stall-on-alter-table user-changes auto-increment-multiples deadlock-found resolve-commit-failure multi-master-conflicts error-unknown-command group-commit large-transactions parallel-applier-threads detecting-slow-node wan-latency wan-replication customizing-gcache-size gcache-during-state-transfers single-master-setup multi-master-setup two-node-clusters setting-selinux-policy sync-transaction-before-another .. |---| unicode:: U+2014 .. EM DASH :trim: