Monitoring Memory Modules

Home  Previous  Next

The main memory of a computer is actually as critical as the processors since almost all processor operations deal with the memory. A single memory fault could lead to a severe computer crash potentially leading to data corruption as well. On servers, the memory modules (devices where memory data is stored) often include auto-correction features (ECC), and sometimes even better: RAID5-like memory configuration. These features and configurations allow the memory modules to report statistics on failures, to predict failures, to hot-replace a memory module upon failure, etc.

Depending on the available information and the features provided by the motherboard and the memory modules, the ErrorStatus and/or ErrorCount and/or PredictedFailure and/or Status parameters will be displayed for each discovered memory module:

The ErrorCount parameter reports the number of errors detected by the memory module and then corrected. A steadily growing value means that the memory module is not reliable and that it could encounter errors that it is unable to correct, which mean this could lead to a system crash.
The ErrorStatus parameter deals with the same kind of error as the ErrorCount parameter but in a more accurate manner. The only difference is that, once a specified threshold is reached, an alert is triggered, so the number of errors is not actually displayed here. It is unnecessary to set an error threshold as the threshold is calculated and set by the hardware agent.
The PredictedFailure parameter reported by the memory modules predicts failure by analyzing the trend of the number of detected/corrected errors (with the ECC technology). If this parameter goes into alarm, replacement of the faulty memory module is highly recommended.
The Status parameter represents the current status of the memory module. An alert is triggered if the memory module reports a failure (in a RAID5-like configuration) or if it is missing after a computer reboot.

See Also

Component Monitoring