Network Partitions and the UNIX Boot Monitor
Most UNIX systems provide a console-abort sequence that enables you to halt and continue the processor. Continuing operations after the processor has stopped may corrupt data and is therefore unsupported by VCS.
When a system is halted with the abort sequence, it stops producing heartbeats. The other systems in the cluster consider the system failed and take over its services. If the system is later enabled with another console sequence, it continues writing to shared storage as before, even though its applications have been restarted on other systems.
VERITAS recommends disabling the console-abort sequence or creating an alias to force the "go" command to perform a reboot on systems not running I/O fencing.
Reconnecting the Private Network
When a final network connection is lost in clusters not running I/O fencing, the systems on each side of the network partition segregate into sub-clusters.
Reconnecting a private network after a cluster has been segregated causes HAD to stop and restart. There are several rules that determine which systems will be affected.
- On a two-node cluster, the system with the lowest LLT host ID stays running and the higher stops and restarts HAD.
- In a multi-node cluster, the largest running group stays running. The smaller groups stop and restart HAD.
- On a multi-node cluster splitting into two equal size clusters, the cluster with the lowest node number stays running and the other cluster stops and restarts HAD.
|