Applicable Products
- QuTS hero h5.3.0 or later
- High Availability Manager 1.0 or later
Definition and Cause
In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.
Common causes of split-brain include:
- Network disconnection between the nodes in the cluster
- Failure of the heartbeat connection
- Unstable or inconsistent network paths
Solution
- Fix the network connection between the nodes.
First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
Only after the connection is restored can the system proceed to verify the cluster status. - Let the system automatically detect the split-brain status.
- Once the nodes reestablish communication, the system exchanges status information between the two nodes.
- If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
- To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
- Recover from split-brain via High Availability Manager.
- Open High Availability Manager.
- Click Recover from Split-Brain to launch the recovery wizard.
In the wizard, you can choose one of the following recovery options:- Option 1: Preserve data on one node only
Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly. - Option 2: Preserve data on both nodes
If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
- Optional: Minimize future split-brain by enabling a quorum server.
If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.
Further Reading
适用产品
- QuTS hero h5.3.0 or later
- High Availability Manager 1.0 or later
定义和原因
在高可用性 (HA) 群集 中,脑裂发生在两个节点失去通信但仍能独立运行时,两个节点都假设自己是活动节点。这可能导致数据不一致或共享的 存储 损坏,因为每个节点可能同时尝试控制共享资源。
脑裂的常见原因包括:
- 群集 中节点之间的网络断开
- 心跳连接失败
- 网络路径不稳定或不一致
解决方案
- 修复节点之间的网络连接。
首先检查并恢复两个节点之间的网络连接(例如,心跳连接、交换机、网络设置)。
只有在连接恢复后,系统才能继续验证 群集 状态。 - 让系统自动检测脑裂状态。
- 一旦节点重新建立通信,系统会在两个节点之间交换状态信息。
- 如果两个节点都假设自己是活动节点,系统会识别为脑裂状态。
- 为了防止数据损坏,系统会停止大多数服务(如 SMB、iSCSI)并显示错误信息,指示脑裂已发生。
- 通过 High Availability Manager 从脑裂中恢复。
- 打开 High Availability Manager。
- 点击 从脑裂中恢复 启动恢复向导。
在向导中,您可以选择以下恢复选项之一:- 选项 1:仅保留一个节点上的数据
选择要保留的节点,另一个节点将被清除并重置为被动节点。系统将重新同步 HA 群集。
当您明确知道哪个节点具有正确数据并希望快速恢复 群集 时,此选项适用。 - 选项 2:保留两个节点上的数据
如果两个节点都包含重要数据,系统允许一个节点先恢复服务,而另一个节点从 群集 中移除。
在验证和协调数据后,您可以手动将移除的节点重新加入 群集。
- 可选:通过启用仲裁服务器来减少未来脑裂的发生。
如果节点彼此断开但仍连接到网络,仲裁服务器仍可以监控各个节点并传递它们的状态。这有助于减少脑裂的可能性。
您可以通过进入 High Availability Manager > 设置 > 故障转移 策略 > 仲裁服务器 来配置仲裁服务器。
进一步阅读