Hardware Monitoring: Errors Triggering Alerts on the ErrorCount Parameter for Physical Disks on Sun Solaris Systems
KB1083 - Aug 01, 2011
Description: On Sun Solaris systems, the ErrorCount parameter shows the number of errors that occurred on each physical disk since the last counter reset (either a reboot or a manual acknowledge and reset of the parameter). This article details which conditions are taken into account by the ErrorCount parameter and trigger an alarm.
Additional Keywords: Alerts, dd, ErrorCount, iostat
On Sun Solaris systems, the ErrorCount parameter shows the number of errors that occurred on each physical disk since the last counter reset (either a reboot or a manual acknowledge and reset of the parameter). This article details which conditions are taken into account by the ErrorCount parameter and trigger an alarm.
On Sun Solaris systems, the health of the physical disks is assessed by using two commands:
- “dd” to verify that the disk is accessible for read operations (this will be represented with the Status parameter)
- “iostat -E” to report the number of errors that occurred on the disk (this will be represented with the ErrorCount parameter)
The ErrorCount parameter represents the total number of errors since the last counter reset (i.e. a reboot, a PATROL Agent restart or RSM restart, a manual acknowledge and reset of the parameter through the KM Command). As new errors occur, the counter is increased one by one over time. The default alert threshold triggers an alarm as soon as an error is reported by “iostat”. The ErrorCount parameter then stays in the ALARM state until it is cleared by the operator. This also means that no new ALARM event is generated for additional errors that occur on the disk until the ErrorCount parameter is reset to zero.
Hardware Sentry KM monitoring the physical disks of a Sun Blade T6300 machine under Solaris 10.
The “iostat” command actually returns a series of error counters for various types of events. The table below summarizes which types of errors are taken into account by Hardware Sentry KM and BMC Performance Manager Express for Hardware in the ErrorCount parameter.
|“iostat” Error||Description||Triggers an ALARM on ErrorCount||Comments|
|Device Not Ready||The drive returned the sense key 0x2 (Not ready).||Yes|
|Media Error||The drive returned the sense key 0x3 (Medium Error).||Yes|
|No Device||The drive returned the sense key 0x6 (Unit Attention) or in the case of a removable device it must have happened multiple times.||Yes||CD and DVD drives are excluded from the monitoring|
|Hard Errors||All the above conditions are counted as Hard Errors with the addition of the SCSI sense key 0x4 (Hardware Error).||Yes|
|Illegal Request||The drive returned the sense key 0x5 (Illegal Request). This is treated as a Soft Error and kstat is also incremented.||No|
|Recoverable||The drive returned the sense key 0x1 (Recovered Error) to indicate that the last command completed successfully but some recovery action had to be taken by the drive. This is treated as a Soft Error and kstat is also incremented.||Yes||While recovered, an actual error occurred and needs therefore to be reported|
|Predictive Failure Analysis||The drive returned sense key 0x6 (Unit Attention) with and ASC (Additional Sense Code) of 0x5D indicating that the drive has exceeded its predictive failure threshold. This is treated as a Soft Error.||Yes||Because of the nature of the counter, it has been preferred to report the problem with the ErrorCount parameter rather than PredictedFailure (which triggers a WARNING) to allow the operator to acknowledge and reset the counter.|
|Transport Error||This error occurs for a number of reasons all related to being unable to transport the command. The command could have been timed out or reset or the host bus adapter unable to put the command onto the SCSI bus. This is neither a Soft nor a Hard Error.||Yes|
|Soft Errors||Illegal Requests, Recoverable errors and Predictive Failure Analysis are all summed up in the “Soft Errors” group, including various driver-related problems.||No|