Preventing Split-brain
This section describes the strategies that could be used to prevent split-brain.
Coordinated Cluster Membership (Membership Arbitration)
When cluster nodes lose heartbeat from another node, the surviving nodes can:
- Assume the departed node is down; this presents data integrity risks.
- Take positive steps to ensure that the remaining nodes are the only surviving cluster members. This is known as membership arbitration.
Membership arbitration ensures that on any change in cluster membership, the surviving members determine if they are still allowed to remain running. In many designs, this is implemented with a quorum architecture.
A cluster using the quorum architecture requires at least 51% of available nodes to be alive. For example, in a 4 node cluster, if one node separates from the cluster due to an interconnect fault, the separated node is not capable of surviving. When the node receives notification that the cluster membership has changed, it determines that it is no longer in a membership with at least 51% of configured systems, and shuts down by calling a kernel panic.
Quorum is usually implemented with more than just systems in the quorum count. Using disk devices as members lends greater design flexibility. During a cluster membership change, remaining nodes attempt to gain exclusive control of any disk devices designated as quorum disks.
Membership arbitration is designed to ensure departed members must really be down. However, a membership arbitration scheme by itself is inadequate for complete data protection.
- A node can hang, and on return to processing, perform a write before determining it should not be a part of a running cluster.
- The same situation can exist if a node is dropped to the system controller/prom level and subsequently resumed. Other systems assume the node has departed, and perform a membership arbitration to ensure they have an exclusive cluster. When the node comes back it may write before determining the cluster membership has changed to exclude it.
In both cases, the concept of membership arbitration/quorum can leave a potential data corruption hole open. If a node can write before determining it should no longer be in the cluster, and panic, it would result in silent data corruption.
What is needed to augment any membership arbitration design is a complete data protection mechanism to block access to disks from any node that is not part of the active cluster.
Data Protection Mechanism
A data protection mechanism in a cluster is a method to block access to the disk for any node that should not be currently accessing the storage. Typically this is implemented with a SCSI reserve mechanism. In the past, many vendors implemented data protection using the SCSI-II Reserve/Release mechanism.
SCSI-II reservations have several limitations in a cluster environment where storage technology has evolved from SCSI-attached arrays to fiber channel SAN.
- SCSI-II reservations are designed to allow one active host to reserve a drive, thereby blocking access from any other initiator. This design was adequate when simple JBOD and early arrays had one path to disk, and were shared by two hosts. SCSI-II cannot support multiple paths to disk from a host (such as VERITAS Dynamic Multi Pathing) or more than one host being active at a time with a reservation in place.
- SCSI-II reservations can be cleared with a SCSI bus reset. Any device can reset the bus and clear the reservation. It is the responsibility of the reserving host to reclaim the reservation if it is cleared. Problems arise in more complicated environments, such as SAN-attached environments where multiple systems could potentially reset a reservation and open up a significant data corruption hole for a system to write data.
|