Operate in TSPS

This section describes common hardware monitoring operations that can be performed with the Hardware Sentry and its TSPS Component.

Monitoring Hardware Devices

Monitoring hardware devices is essential to system administrators, as it provides full visibility into physical servers health and triggers alarms when a problem occurs. Hardware Sentry discovers all the hardware devices (servers) within your monitored environment and leverages the capabilities of the TSPS Component to display a comprehensive list of your systems under the Hardware Devices page of the TSPS console. The TSPS Component extends the range of information reported within the TrueSight standard view with additional information, such as power consumption and server temperature data.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane. Hardware Devices
  3. Browse the list to locate a specific device or search for it by hostname, vendor, model or serial number.
  4. Click the device your are interested in to display a detailed view of the device configuration, its associated monitors and events. Two additional tabs, Power and Cooling, provide respectively the power consumption and energy usage readings, and the heating margin and temperature levels for the selected device.

    Hardware Device Monitors

Since the monitors are fetched from the PATROL Agent, additional information collected by Hardware Sentry are displayed under each monitor instance (Serial Number, Model, Manufacturer, etc.) and, notably, the current status of each monitored component is displayed. When a Monitor is displayed in red (ALARM) or yellow (WARN), it means the corresponding device is currently failing or degraded (contrary to other views in TSPS where the color represents the number of events in the past).

Monitors Information

Checking Disks Health

Most manufacturers typically use the “mean time to failure” or MTTF to indicate the operational reliability of their products. But the advertised MTTF of 1,000,000 hours (or even more!) is misleading. Recent studies show that the average annual replacement rate for hard disks is typically between 3% and 15%. This means that an organization with just 100 servers and approximately 300 hard disks will experience between 9 and 45 disk failures every year which, if they do not impact the availability of the system, will surely degrade the overall performance dramatically. An organization with 1000 servers will experience almost a disk failure each day of the year. Given their relevant short lifetime and the amount of data they store, disks are one of the most critical devices to be monitored.

Monitoring disk health consists in closely monitoring the 3 typical types of disks: disk controllers, physical and logical disks.

Monitoring Disk Controllers

A disk controller is a card inside a computer that connects one or several physical disk drives to the computer and write cache. To keep this write from being lost if power is interrupted, the card must be configured with a battery. It is thus recommended to closely monitor this battery.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, click a disk controller Monitor. Expand its sub-components to display the batteries Monitor and rapidly verify that none of them are in a WARNING or ALARM state and, therefore capable of supporting the controller in the event of a power failure. Monitoring Disk Controllers and Batteries
  4. Click the Battery Monitor to be redirected to the Monitor detailed page where you can view the values of the Status and the Status Information parameters. Battery Details
  5. Click the Status parameter or its value to display a graph showing the history of the battery status values. Battery Status Graph

It is also recommended to verify the controller health. Perform the same procedure on the Controller Monitor to make sure the controller is not degraded or has not failed.

Monitoring Physical disks

Physical disks must be monitored to avoid loss of data, unavailability and performance degradation. Contrary to other solutions, Hardware Sentry monitors the actual physical disks (Hardware Physical Disk) behind the controller and not only the disks as seen by the operating system.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate Physical Disks Monitor and verify that none of them have failed or are being degraded. Errors are displayed in the Information column. Monitoring Physical Disks
  4. Click a Physical Disk Monitor to be redirected to the Monitor detailed page where you can view the values for the Present, Status and Status Information parameters. Physical Disk Monitor Information

Monitoring Logical Disks

RAID or advanced disk controllers expose several physical disks as a single logical disk to the operating system. The information required by administrators is mainly the logical disk’s status, its RAID type and size. To get that information:

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, expand the Physical Disks Monitor and click a Logical Disk Monitor to be redirected to the Monitor detailed page. Logical Disk Information
  4. Click the Status parameter to display a graph showing the history of the status values for the logical disk.

Adopting a predictive approach to monitoring a datacenter includes to closely monitor the state and performance of key-components such as processors and disk drives. The valuable indicators provided by Hardware Sentry helps IT administrators implement and maintain a proactive monitoring strategy.

Diagnosing Datacenter Electrical Issues

Understanding the basics of the electrical distribution system can help IT administrators diagnose data center electrical issues. Power is delivered to a data center by the local utility company. Once inside the building, the utility power goes to the Automatic Transfer Switch and to the uninterruptible power supply (UPS) units. These units clean the incoming utility system before passing it to power distribution units (PDUs) for conversion. Power will finally be distributed to electrical outlets and servers. During the distribution, power loss or instability can occur. It can be caused by voltage or AC/DC conversion, hence the importance to monitor voltage and power supplies.

To monitor voltage

Monitoring voltage helps verify the quality of power supplies. In fact, if the power supply is weak, the voltage level on the motherboard will not be steady, which could lead to random crashes or to errors at the processor or memory levels.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Voltages Monitor. Monitoring Voltages
  4. Click the voltage value in the Information column, to display a graph showing the voltage (mV) history. Use the arrow buttons located at the bottom of the graph to navigate through the time range. Voltage Graph

Higher voltage and fewer fluctuations in voltage will always guarantee better efficiency. If you notice voltage fluctuations, verify your electrical connections and wiring.

To monitor power supplies

After hard drives, the power supply is the device that is most likely to fail. The proper functioning of this device highly depends on the quality of the data center electrical distribution. Indeed, voltage fluctuations are detrimental to power supplies: they can shorten their life span or lead to severe malfunction.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Power Supplies Monitor and verify that none of the power supplies have failed or is degraded. Monitoring Power Supplies
  4. Click a Power Supply Monitor to be redirected to the Monitor detailed page where you can view the values of the Present, Status and Status Information parameters. Power Supply Details
  5. Click the Present parameter to display a graph showing the power supply history. Power Supply Graph

Managing Datacenter Heating and Cooling Issues

Even though datacenters and servers are cooled down with air conditioning and fans, computing systems may overheat. Because overheating will lead to a general instability, Hardware Sentry monitors the fans, when present, and all the temperature sensors. Automatic thresholds are set according to the manufacturers’ recommendation and the location of the temperature sensor.

The temperature thresholds set by Hardware Sentry should not be customized or modified.

To monitor the datacenter temperature

  1. Login to your TrueSight console.
  2. Select Green IT > Groups from the navigation pane and click a Group.
  3. The Ambient Temperature (°C) and the Heating Margin (°C) values are displayed at the top of the page.
  4. The current Heating Margin (Degrees Below Warning °C) per physical server is displayed in the table listing the Group’s devices. Datacenter Temperatures
  5. Click a physical server to be redirected to the device detailed page and click the Cooling tab to display the Degrees Below Warning graph. Device Degrees Below Warning

Refer to the Group Details documentation to learn more about how the Green IT extension calculates each temperature indicators.

To monitor the fan performance of servers

The temperature inside a server case is controlled with fans. To prevent internal temperature to get too high, verify that the fan is operating properly.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Fans Monitor. Monitoring Fans
  4. Click a Fan Monitor to be redirected to the Monitor detailed page where you can view the values of the Present and Status parameters.
  5. Click the Speed parameter to display a graph showing the history of the speed values for the fan. Fan Speed Graph

A fan which is no longer spinning or is turning too slowly should be replaced immediately.

To monitor the temperature of devices

Monitoring temperature sensors helps identify which device is properly operating and which is in poor or critical condition.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Temperatures Monitor and verify that no sensor is in an ALARM or WARNING state. Device Temperatures
  4. Click the Temperature value (°C) in the Information column to display a graph showing the history of the temperature values. Temperature Graph
  5. To get even more precise information about all the temperature sensors click the Cooling tab of a physical server page to display:
    • a graph showing the Degrees Below Warning (°C) values for all the temperature sensors of the devices
    • a graph showing Temperatures values collected for each sensor of the device. Device Temperatures Graph

Monitoring Network Traffic & Preventing Bottlenecks

Applications rely on the network whose bandwidth and latency has a dramatic impact on the overall measured and perceived IT performance. Hardware Sentry monitors the connectivity and the quality of the network connections. The incoming and outgoing traffic is also constantly measured against the available bandwidth to give system administrators the short term and long-term visibility on the network capacity utilization.

To verify a network connection

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Network Interfaces Monitor and verify that none of the network interfaces are in an ALARM or WARNING state. Device Network Interfaces
  4. Click the Link Speed value of a network interface in the Information column, to display a graph showing the history of the Link Speed values. Use the arrow buttons located at the bottom of the graph to navigate through the time range. For Ethernet or fiber adapters, any movement on this parameter indicates that the quality of the connection is poor and needs to be improved. By default, a warning event is triggered when the link speed downgrades from its current value to a lower value (from 1Gb/s to 100Mb/s for example). Device Network Interface Link Speed
  5. Click a Network Interface Monitor to be redirected to the Monitor detailed page where all the network interface’s parameters are displayed. Network Interface Details

To monitor the transmission rates

Transmission rates monitoring provides administrators with valuable information about the incoming and outgoing data managed by servers and switches and help identify the traffic demands and peak periods.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Network Interfaces Monitor.
  4. Click a Network Interface to be redirected to the Monitor detailed page where all the network interface’s parameters are displayed.
  5. Click the Received Packets Rate, Transmitted Packets Rate, Received Bytes Rate and Transmitted Bytes Rate to view the transmission rates of your network interface.

To monitor the bandwidth utilization

Monitoring the bandwidth utilization of network interfaces can help identify unexpected and random peaks in the network activity, which could hide business critical issues, such as a network attack or unauthorized transfer of data.

The Bandwidth Utilization parameter can ONLY be collected if Link Speed, Duplex Mode, Received Bytes Rate and Transmitted Bytes Rate are all properly collected.

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server.
  3. From the Monitors list, locate the Network Interfaces Monitor.
  4. Click a Network Interface to be redirected to the Monitor detailed page where all the network interface’s parameters are displayed.
  5. Click the Bandwidth Utilization parameter to display a graph showing the history of the bandwidth utilization values. Use the arrow buttons located at the bottom of the graph to navigate through the time range.

Predicting Hardware Failures

Even though end-users expect the IT environment they rely on to be flawless, it is common knowledge that hardware components are inherently prone to failure. In most cases, electronic components work as expected or fail completely and it is rare to be able to observe such components degrade slowly over time. That is the reason why Hardware Sentry only reports the overall status for most object classes as simply “OK” or “Failed”.

However, some components are able to report their own degradation and warn the administrator of an imminent failure. Such components include:

  • the processors (the more computation errors they detect and correct automatically, the more likely they will fail soon).
  • the memory modules (an increasing number of fixed ECC errors means the module is nearing its end of life).
  • the hard disks (many internal metrics are constantly analyzed by the disk itself to assess its own health and predict an imminent failure – this technology is standard and is called S.M.A.R.T.).

When such information is properly reported by the component or the instrumentation layer of the system itself, Hardware Sentry will trigger an event to warn the administrators that an imminent failure of a processor, a memory module or a physical disk is likely to occur.

To monitor potential hardware failures

  1. Login to your TrueSight console.
  2. Select Monitoring > Hardware Devices from the navigation pane and click a physical server. Monitoring Processors
  3. From the Monitors list, locate the Processors Monitor.
  4. Click a Network Interface to be redirected to the Monitor detailed page where all the network interface’s parameters are displayed. Processor Details
  5. Click the Predicted Failure parameter to display a graph showing the history of the predicted failure values. If this parameter shows values equal to 1 and goes into alarm, the faulty hardware should be replaced. Predicted Failure Graph
Keywords:
hardware km patrol