Understanding the ALOM Watchdog Timer
ALOM features a watchdog mechanism to detect and respond to a system hang,
should one ever occur.
Note: The ALOM watchdog feature is not supported on all platforms.
For more information about whether your host system is supported, refer to the
Release Notes for your version of the ALOM software.
The ALOM watchdog is a timer that is continually reset by a user application,
as long as the operating system and user application are running. In the event
of a system hang, the user application is no longer able to reset the timer.
The timer will then expire and will perform an action that has been set by the
user, eliminating the need for operator intervention.
In order to fully understand the ALOM watchdog timer, it’s useful to understand
certain terms associated with the feature’s components and how all of the components
interact.
- If the ALOM watchdog timer is enabled, it will automatically begin monitoring
the host server, and will detect when the host or application encounters a
hang condition or stops running. The default timeout period is 60 seconds;
in other words, if the ALOM watchdog timer does not hear from the host system
within that 60-second window, it will automatically perform the action that
you specify in the sys_autorestart
variable. You can change the timeout period through the sys_wdttimeout
variable.
- If you set XIR as the function that ALOM would perform once the watchdog
timer timeout period is reached, then ALOM will attempt to XIR the host system.
If the XIR does not complete within the specified number of seconds (set through
the sys_xirtimeout
variable), then ALOM forces the server to perform a hard reset instead.
- The ALOM watchdog should be enabled by the user application after the host
system is booted up. ALOM starts a timer to detect host boot failures as soon
as the host is powered on or reset. The host is considered fully booted once
the ALOM watchdog timer is started. If the host fails to boot within a certain
amount of time, it will take an action that you have specified. You use the
sys_boottimeout
variable to specify the amount of time that the ALOM watchdog will wait for
the host to boot. You specify the action it will take if it doesn’t boot in
that time through the sys_bootrestart
variable. You can set the maximum number of attempted reboots using the sys_maxbootfail
variable, to keep the system from going through an endless cycle of reboots.
If the system goes through the number of reboots set through the sys_maxbootfail
variable, then ALOM will perform an action that you specify through the sys_bootfailrecovery
variable.
Managed system interface variables
Sample ALOM watchdog program