What is split-brain and how do I recover?


最后修订日期: 2025-05-28

Applicable Products

  • QuTS hero h5.3.0 or later
  • High Availability Manager 1.0 or later

Definition and Cause

In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.

Common causes of split-brain include:

  • Network disconnection between the nodes in the cluster
  • Failure of the heartbeat connection
  • Unstable or inconsistent network paths

Solution

  1. Fix the network connection between the nodes.
    First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
    Only after the connection is restored can the system proceed to verify the cluster status.
  2. Let the system automatically detect the split-brain status.
    1. Once the nodes reestablish communication, the system exchanges status information between the two nodes. 
    2. If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
    3. To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
  3. Recover from split-brain via High Availability Manager.
    1. Open High Availability Manager.
    2. Click Recover from Split-Brain to launch the recovery wizard.
      In the wizard, you can choose one of the following recovery options:
      • Option 1: Preserve data on one node only
        Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
        This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly.
      • Option 2: Preserve data on both nodes
        If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
        After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
  4. Optional: Minimize future split-brain by enabling a quorum server.
    If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
    You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.

Further Reading

这篇文章有帮助吗?

谢谢您,我们已经收到您的意见。

请告诉我们如何改进这篇文章:

如果您想提供其他意见,请于下方输入。

选择规格

      显示更多 隐藏更多
      open menu
      back to top