This section describes common hardware monitoring use cases.
Checking Disks Health
Most manufacturers typically use the “mean time to failure” or MTTF to indicate the operational reliability of their products. But the advertised MTTF of 1,000,000 hours (or even more) is misleading. Recent studies show that the average annual replacement rate for hard disks is typically between 3% and 15%. This means that an organization with just 100 servers and approximately 300 hard disks will experience between 9 and 45 disk failures every year, which if they do not impact the availability of the system will surely degrade the overall performance dramatically. An organization with 1000 servers will experience almost a disk failure each day of the year. Given their relevant short lifetime and the amount of data they store, disks are one of the most critical devices to be monitored.
Monitoring disk health consists in monitoring disk controllers, physical and logical disks.
Monitoring Disk Controllers
A disk controller is a card inside a computer that connects one or several physical disk drives to the computer and write cache. To keep this write from being lost if power is interrupted, the card must be configured with a battery. It is thus recommended to closely monitor this battery.
- In the PATROL console, double-click the BatteryStatus parameter to verify that the disk controller battery is able to support the controller in the event of a power failure.
- You can also double-click the ControllerStatus parameter to make sure the controller is not degraded or has not failed.
Monitoring Physical disks
Physical disks must be monitored to avoid loss of data, unavailability and performance degradation. Contrary to other solutions, Hardware Sentry monitors the actual physical disks behind the controller and not only the disks as seen by the operating system. The application class used is MS_HW_PHYSICALDISK.
- In the PATROL Console, double-click the Status parameter to verify that physical disk has not failed or is not degraded.
- If available, double-click the PredictedFailure parameter to verify that the physical disk is not likely to fail.
In case a replacement is required, right-click, in the PATROL Console, the Physical Disk Icon > InfoBox. All the relevant information (vendor name, model, serial number, etc.) is displayed:
Monitoring Logical Disks
RAID or advanced disk controllers expose several physical disks as a single logical disk to the operating system. The information required by administrators is mainly the logical disk’s status, its RAID type and size. To get that information:
- Right-click the Logical Disk icon > Status. A graph is displayed in the graph pane indicating the status of the RAID array (OK, Degraded, Failed).
Right-click the Logical Disk Icon > InfoBox. A dialog box is displayed:
It contains all the relevant information about the selected logical disk.
Diagnosing Data Center Electrical Issues
Understanding the basics of the electrical distribution system can help IT administrators diagnose data center electrical issues. Power is delivered to a data center by the local utility company. Once inside the building, the utility power goes to the Automatic Transfer Switch and to the uninterruptible power supply (UPS) units. These units clean the incoming utility system before passing it to power distribution units (PDUs) for conversion. Power will finally be distributed to electrical outlets and servers. During the distribution, power loss or instability can occur. It can be caused by voltage or AC/DC conversion, hence the importance to monitor voltage and power supplies.
Monitoring voltage will help you verify the quality of your power supplies. In fact, if the power supply is weak, the voltage level on the motherboard will not be steady, which could lead to random crashes or to errors at the processor or memory levels.
In the console, double-click the Voltage parameter of the MS_HW_VOLTAGE application class. A graph is displayed in the graph pane:
Verify that the voltage is not fluctuating.
Higher voltage and fewer fluctuations in voltage will always guarantee better efficiency. If you notice voltage fluctuations, verify your electrical connections and wiring.
Monitoring Power supplies
After hard drives, the power supply is the device that is most likely to fail. The proper functioning of this device highly depends on the quality of the data center electrical distribution. Indeed, voltage fluctuations are detrimental to power supplies: they can shorten their life span or impair them.
- In the console, double-click the Status parameter of the MS_HW_POWERSUPPLY application class to verify that the power supply has not failed or is not degraded. Several power supply failures may reveal an issue on the data center electrical distribution.
- You can also double-click the UsedCapacity parameter to make sure the power supply’s maximum power output is not reached.
Locating a Device in a System
Several IDs and names are provided by Hardware Sentry to help you locate a device in a system. These IDs can be obtained through the Hardware Inventory, Hardware Health Report, and the Infobox. Here are the existing IDs:
(internal) PATROL Object ID: This ID is used to create the instance of the class. This ID is retrieved in the PATROL events that other event management systems gather (like BMC Enterprise Manager, etc.). It is a concatenation of the corresponding connector file name, the host name and the internal device ID.
(internal) Device ID: This ID is used to uniquely identify the device in the underlying instrumentation layer. Depending on the instrumentation layer, the internal Device ID can take very different forms:
- a simple number
- a dot-separated list of small numbers (i.e. 0.0 or 1.1.2)
- an alpha-numeric string (i.e. VCC_+12V)
- a complex WBEM or PnP ID (i.e. IDE\DiskHitachi_HTS721080G9SA00_________MC4OC10H\5&1f698b3f&0&0.0.0_0).
The internal Device ID is often a good way to locate the corresponding device. When it is not too long, it is displayed in the label of the icon representing that object in the PATROL Console. Otherwise, the internal Device ID is replaced by an arbitrary number in the icon label.
Name of the icon: This ID represents the monitored component in the PATROL Console. It includes the aforementioned internal Device ID (when it is not too long) as well as some additional information such as the vendor and the model of the component. In some cases, the location of the device or sensor is clearly mentioned.
Additional “identifying information": The availability of “additional identifying information” depends on the type of the monitored component and the underlying instrumentation layer. For example, when monitoring a Dell server, Hardware Sentry shows the DIMM slot location of each memory module, and their part number. In case of a memory failure, this greatly helps administrators order a new memory module with the right part number and change the faulty one with no risk of confusion with another working memory slot.
Managing Datacenter Heating and Cooling issues
Even though datacenters and servers are cooled down with air conditioning and fans, computing systems may be overheated. Because overheating will lead to a general instability, Hardware Sentry monitors the fans, when present, and all the temperature sensors. Automatic thresholds are set according to the manufacturers’ recommendation and the location of the temperature sensor. There is therefore no need to customize the alert thresholds for temperature sensors and fans speed.
The temperature thresholds set by Hardware Sentry should not be customized or modified.
Managing datacenter heating and cooling issues consists in:
- Monitoring the datacenter temperature
- Monitoring the ambient temperature
- Verifying the temperature of all the computing systems.
Monitoring the Datacenter Temperature
You can prevent hardware overheating by properly cooling down your server room. We recommend you to optimize the environmental temperature to prevent both hardware overheating and avoid expensive energy bill.
In the console, double-click the DegreesBelowWarning parameter located under the Report icon (MS_HW_REPORT application class). A graph is displayed in the graph pane:
If the number of degrees before warning threshold is:
- high, the temperature of your server room is too cold. You should therefore warm it up to reduce your energy bill without risking hardware overheating.
- close to zero, the temperature of your server room is probably too hot.
- zero, the temperature of your server room is definitively too hot. A warning will be triggered. It is then recommended to check the air-conditioning is properly working.
Monitoring the Cooling for the Servers
The temperature inside a server case is controlled with fans. To prevent ambient temperature to get too high, verify that the fan is properly working:
- First verify that the fan is available and spinning by respectively double-clicking the Present and Status parameters of the MS_HW_FAN application class.
- Then double-click the:
- Speed parameter to verify that its speed is not too fast or too low.
- SpeedPercent parameter to make sure the fan has not reached its maximum speed.
Cooling might not be sufficient if the fan speed is too low or if maximum speed is reached.
A fan which is no longer spinning or is turning too slowly should be replaced immediately.
Verifying the temperature of all your computing systems
You can create a PATROL query to verify the temperature of all your computing systems. Once executed, this query will display a list of results. Depending on your configuration, this list will either contain all the states or only the critical ones (e.g.: Warning and Critical).
In the main menu bar of the PATROL Console, click Actions > New Query… to create a PATROL query
In the General tab:
- Enter the Query name and description
- In the Query Results Filter section, select Show Selected Objects and check the Parameters box
- In the Additional Filtering section, select the Enable Application Class Level Filtering and Enable Parameter Level Filtering options
Open the Application Class tab:
- In the Pattern matching section, select Like and type MS_HW_TEMPERATURE
Open the Parameter tab:
- If you only want to display the warning and critical results, choose Selected States and check the Warning and Critical boxes
- In the Pattern Matching section, select Like and type temperature
Monitoring Network Traffic & Preventing Bottlenecks
Today, every application relies on the network whose bandwidth and latency has a dramatic impact on the overall measured and perceived IT performance. Hardware Sentry monitors the connectivity and the quality of the connection. The incoming and outgoing traffic is also constantly measured against the available bandwidth to give system administrators the short term and long term visibility on the network capacity utilization.
Verifying the Connection
In the console, double-click the LinkStatus parameter under the Network Interfaces icon (MS_HW_NETWORK application class) to verify that the interface is actually plugged to the network. By default, a warning event is triggered when a connected network adapter is unplugged from the network. Unused adapters will not trigger useless alarms.
The Status parameter only represents the status of the network adapter itself, not the connectivity.
- The LinkSpeed parameter reports on the current speed of the connection. For Ethernet or fiber adapters, any movement on this parameter means the quality of the connection is poor and needs to be fixed. By default, a warning event is triggered when the link speed downgrades from its current value to a lower value (from 1Gb/s to 100Mb/s for example).
The DuplexMode parameter reports whether the Ethernet connection is working in full duplex or half-duplex mode.
- You can also double-click the ErrorPercent parameter to know the percentage of transmitted and received packets that were in error. A frequent high error percentage implies that there is a serious problem with the cable or the interface.
Monitoring the Transmission Rates
Transmission rates monitoring provides administrators with valuable information regarding the incoming and outgoing data managed by servers and switches.
In the console, double-click the parameters of the MS_HW_NETWORK application class:
- ReceivedBytesRate and TransmittedBytesRate to know the data flow exchanged
- ReceivedPacketsRate and TransmittedPacketsRate to know the number of received and/or transmitted packets
The graph displayed will help you identify the traffic demands and peak periods.
Understanding the Bandwidth Utilization
In the console, double-click the BandwidthUtilization parameter of the MS_HW_NETWORK application class to know the percentage used of the available bandwidth. A graph is automatically displayed in the console’s graph pane:
The BandwidthUtilization parameter can ONLY be collected if LinkSpeed, DuplexMode, ReceivedBytesRate and TransmittedBytesRate are all properly collected.
Search for unexpected and random spikes in the network activity, which could signify either an attack or unauthorized transfer of data.
Predicting Hardware Failures
Even though end-users expect the IT environment they rely on to be flawless, it is common knowledge that hardware components are inherently prone to failure. In most cases, electronic components work as expected or fail completely and it is rare to be able to observe such components degrade slowly over time. That is the reason why Hardware Sentry only reports the overall status for most object classes as simply “OK” or “failed”.
However, some components are able to report their own degradation and warn the administrator of an imminent failure. Such components include:
- the processors (the more computation errors they detect and correct automatically, the more likely they will fail soon)
- the memory modules (an increasing number of fixed ECC errors means the module is nearing its end of life)
- the hard disks (many internal metrics are constantly analyzed by the disk itself to assess its own health and predict an imminent failure – this technology is standard and is called S.M.A.R.T.)
When such information is properly reported by the component, the instrumentation layer of the system itself, Hardware Sentry will trigger an event to warn the administrators that an imminent failure of a processor, a memory module or a physical disk is predicted.
In the console, double-click the PredictedFailure parameter of the processor, memory module or physical disk you are interested in. Their application classes are respectively MS_HW_CPU, MS_HW_MEMORY or MS_HW_PHYSICALDISK.
A graph is displayed in the graph pane. If this parameter is equal to 1 and goes into alarm, the faulty hardware should be replaced.
Recognizing Instrumentation Failures
Hardware Sentry does not “talk” directly to the hardware to assess its health. An instrumentation layer bridges the physical components and the software solution. The instrumentation layer can be a driver, a firmware, an API, an SNMP agent, a WBEM or WMI provider or an out-of-band management card with its own IP address. In 99% of the cases, the instrumentation layer is provided by the manufacturer along with the server.
Failures in the instrumentation layer itself do happen and it is important to discriminate these problems from actual hardware failures. System administrators will not take the same actions for an actual disk failure and for an SNMP agent crash, which in itself does not impact the availability and performance of the system.
The connection to each instrumentation layer is represented by an instance of the MS_HW_CONNECTOR class. The availability of the underlying technology is constantly verified and represented with the Status parameter. When an instrumentation failure occurs, an alarm is triggered by the MS_HW_CONNECTOR application class and all of the objects and parameters monitored through this instrumentation layer are taken OFFLINE. This alarm reveals a monitoring issue only, which must be handled by PATROL administrators.
The TestReport parameter can give further information as it indicates whether the test completed successfully or not.
- Hardware Sentry is running on an HP Proliant server with HP Insight Management Agent.
- Upon start up, Hardware Sentry detects the HP Insight Management Agent, and starts using the corresponding connector to discover the server hardware configuration and monitor the discovered devices.
- Additionally, Hardware Sentry creates an icon in the PATROL Console representing the HP Insight Management Agent - Server connector.
- Every 2 minutes, its Status parameter is updated.
- If, for some reason, the HP agent stops working, an alarm is raised on the Status parameter and the devices that were discovered through this connector are taken offline.
This connector monitoring mechanism helps PATROL administrators detect hardware agent failures. It also provides a high monitoring accuracy by not confusing errors that are actually encountered by devices with errors stemming from a monitoring-tool failure.
When connectors are:
- automatically detected by Hardware Sentry, the connectors are grouped under an instance labeled Detected Connectors.
- manually selected – as for remotely monitored objects, the connectors are grouped under an instance labeled Selected Connectors.
To change the settings of connectors, right-click on Hardware > KM Commands > This System’s Settings > Connection, Credentials and Connectors…
Diagnosing Monitoring Problems
The Report instance of Hardware Sentry collects and reports on information that provides an overview of health and performance of a system. Some parameters can be set to trigger alerts when their value reaches an unacceptable or abnormal level and help you to quickly pinpoint a monitoring problem.
To quickly pinpoint a potential monitoring problem
- Expand the Hardware Sentry KM instance > Report.
- Double-click a parameter to display the collected data in the graph pane.
Here are some examples of information you can get from the Hardware Sentry Report instance.
Hardware Discovery Status
A HardwareDiscoveryStatus parameter set to 1 (bottleneck), indicates that the agent is likely to be showing signs of load. When a bottleneck is reported over 2 or more collects in a row, then the agent is overloaded.
Tip: 0 = off and 2 = On, both are OK states.
Counter-type parameters provide valuable information on the number of the components monitored on your system, such as CPUs, Enclosures, Blades, etc.
A Count-All parameter indicating many monitored components lower than 5 may indicate that system’s monitoring is likely incomplete and requires investigation.
These parameters report on the number of successful commands executions performed on the monitored host by protocol. Highly solicited systems have the greatest potential for locking resources longer and their capacity management can therefore require optimization.