< Previous | Next > | |
Product: Cluster Server Guides | |
Manual: Cluster Server 4.1 User's Guide |
Troubleshooting and Recovery for Global ClustersThis section describes the concept of disaster declaration and provides troubleshooting tips for configurations using global clusters. Disaster DeclarationWhen a cluster in a global cluster transitions to the FAULTED state because it can no longer be contacted, failover executions depend on whether the cause was due to a split-brain, temporary outage, or a permanent disaster at the remote cluster. If you choose to take action on the failure of a cluster in a global cluster, VCS prompts you to declare the type of failure.
You can select the groups to be failed over to the local cluster, in which case VCS brings the selected groups online on a node based on the group's FailOverPolicy attribute. It also marks the groups as being offline in the other cluster. If you do not select any service groups to fail over, VCS takes no action except implicitly marking the service groups as offline on the downed cluster. Lost Heartbeats and the Inquiry MechanismThe loss of internal and all external heartbeats between any two clusters indicates that the remote cluster is faulted, or that all communication links between the two clusters are broken (a wide-area split-brain). VCS queries clusters to confirm the remote cluster to which heartbeats have been lost is truly down. This mechanism is referred to as inquiry. If in a two-cluster configuration a connector loses all heartbeats to the other connector, it must consider the remote cluster faulted. If there are more than two clusters and a connector loses all heartbeats to a second cluster, it queries the remaining connectors before declaring the cluster faulted. If the other connectors view the cluster as running, the querying connector transitions the cluster to the UNKNOWN state, a process that minimizes false cluster faults. If all connectors report that the cluster is faulted, the querying connector also considers it faulted and transitions the remote cluster state to FAULTED. VCS AlertsVCS alerts are identified by the alert ID, which is comprised of the following elements:
Alerts are generated in the following format: alert_type-cluster-system-object GNOFAILA-Cluster1-oracle_grp This is an alert of type GNOFAILA generated on cluster Cluster1 for the service group oracle_grp. Types of AlertsVCS generates the following types of alerts.
Some reasons why a global group may not be able to fail over to a remote cluster:
Managing AlertsAlerts require user intervention. You can respond to an alert in the following ways:
An administrative alert will continue to live if none of the above actions are performed and the VCS engine (HAD) is running on at least one node in the cluster. If HAD is not running on any node in the cluster, the administrative alert is lost. Actions Associated with AlertsThis section describes the actions you can perform from the Java and the Web consoles on the following types of alerts:
Negating EventsVCS deletes a CFAULT alert when the faulted cluster goes back to the running state VCS deletes the GNOFAILA and GNOFAIL alerts in response to the following events:
|
^ Return to Top | < Previous | Next > |
Product: Cluster Server Guides | |
Manual: Cluster Server 4.1 User's Guide | |
VERITAS Software Corporation
www.veritas.com |