本内容经过机器翻译。请参见机器翻译免责声明。

What is split-brain and how do I recover?

什么是分脑现象，我该如何恢复？

最后修订日期: 2025-05-28

Applicable Products

QuTS hero h5.3.0 or later
High Availability Manager 1.0 or later

Definition and Cause

In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.

Common causes of split-brain include:

Network disconnection between the nodes in the cluster
Failure of the heartbeat connection
Unstable or inconsistent network paths

Solution

Fix the network connection between the nodes.
First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
Only after the connection is restored can the system proceed to verify the cluster status.
Let the system automatically detect the split-brain status.
1. Once the nodes reestablish communication, the system exchanges status information between the two nodes.
2. If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
3. To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
Recover from split-brain via High Availability Manager.
1. Open High Availability Manager.
2. Click Recover from Split-Brain to launch the recovery wizard.
  In the wizard, you can choose one of the following recovery options:
  - Option 1: Preserve data on one node only
    Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
    This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly.
  - Option 2: Preserve data on both nodes
    If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
    After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
Optional: Minimize future split-brain by enabling a quorum server.
If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.

适用产品

QuTS hero h5.3.0 or later
High Availability Manager 1.0 or later

定义和原因

在高可用性 (HA) 群集中，脑裂发生在两个节点失去通信但仍能独立运行时，两个节点都假设自己是活动节点。这可能导致数据不一致或共享的存储损坏，因为每个节点可能同时尝试控制共享资源。

脑裂的常见原因包括：

群集中节点之间的网络断开
心跳连接失败
网络路径不稳定或不一致

解决方案

修复节点之间的网络连接。
首先检查并恢复两个节点之间的网络连接（例如，心跳连接、交换机、网络设置）。
只有在连接恢复后，系统才能继续验证群集状态。
让系统自动检测脑裂状态。
1. 一旦节点重新建立通信，系统会在两个节点之间交换状态信息。
2. 如果两个节点都假设自己是活动节点，系统会识别为脑裂状态。
3. 为了防止数据损坏，系统会停止大多数服务（如 SMB、iSCSI）并显示错误信息，指示脑裂已发生。
通过 High Availability Manager 从脑裂中恢复。
1. 打开 High Availability Manager。
2. 点击 从脑裂中恢复 启动恢复向导。
  在向导中，您可以选择以下恢复选项之一：
  - 选项 1：仅保留一个节点上的数据
    选择要保留的节点，另一个节点将被清除并重置为被动节点。系统将重新同步 HA 群集。
    当您明确知道哪个节点具有正确数据并希望快速恢复群集时，此选项适用。
  - 选项 2：保留两个节点上的数据
    如果两个节点都包含重要数据，系统允许一个节点先恢复服务，而另一个节点从群集中移除。
    在验证和协调数据后，您可以手动将移除的节点重新加入群集。
可选：通过启用仲裁服务器来减少未来脑裂的发生。
如果节点彼此断开但仍连接到网络，仲裁服务器仍可以监控各个节点并传递它们的状态。这有助于减少脑裂的可能性。
您可以通过进入 High Availability Manager > 设置 > 故障转移策略 > 仲裁服务器 来配置仲裁服务器。

进一步阅读

这篇文章有帮助吗？

是否

请告诉我们如何改进这篇文章：

这篇文章缺了重点讯息
这篇文章的解决方案没有用
这篇文章太过复杂
这篇文章包含了不正确的讯息
这篇文章的信息已过时

如果您想提供其他意见，请于下方输入。

NAS

网通

监控