Hardware Sentry KM for PATROL

Release Notes for v1.8.00

Home  Previous  Next

What's New

New Supported Platforms

Platform

Monitored Components

Brocade Embedded SAN Switches in DELL Blade and HP BladeSystem

FC ports (status, link status, link speed, error percentage, traffic in MB/s), Ethernet ports (status, link status, link speed, error percentage, traffic in MB/s), temperatures, fans and power supplies.

EMC VNX and VNXe

Disks, storage processors, FC ports, Ethernet ports, fans, power supplies and batteries

Emulex HBA on Solaris

Status, link status, link speed and traffic (in packets/s and MB/s).

Note: the connector relies on the hbacmd command line utility

HP ProLiant running VMware ESX or Linux (WBEM)

CPUs, memory, network cards, HBA, disks, temperature, fans, voltages and power supplies.

Note: A new connector adds support for the WBEM version of the HP Insight Management Agent on Linux and which replaces the native VMware ESX hardware monitoring agent.

HP StorageWorks MSA 2000 and P2000

Disks, logical disks, network cards, FC ports, temperatures, fans, voltages and power supplies

QLogic HBA on Solaris

Status, link status and link speed.

Note: the connector relies on the scli command line utility

VMware ESX5i

CPUs, memory, network cards, HBA, disks, temperatures, fans, voltages and power supplies.

Application Classes and Parameters

New ErrorCount-type Parameters for the MS_HW_PHYSICALDISK Application Class: The following parameters have been added to this Application Class to bring more accurate information related to the ErrorCount values:
DeviceNotReadyErrorCount (number of times the disk has been reported not ready)
MediaErrorCount  (number of media errors reported by the disk)
NoDeviceErrorCount  (number of no device errors reported on the bus)
HardErrorCount  (number of hard errors reported by the driver)
IllegalRequestErrorCount (number of illegal requests reported by the driver, no alert is triggered by default)
RecoverableErrorCount (number of recoverable errors reported by the driver, no alert is triggered by default)
TransportErrorCount (number of transport errors reported by the bus)

Changes and Enhancements

Supported Platforms

Platform

Changes/Improvements

All Linux systems with SNMP enabled

Ethernet ports are better identified with their device name (e.g. eth0) instead of an SNMP-based number.

Brocade SAN Switches

Added the monitoring for the Control Processors, Switch Blades and CR Switching Blades to facilitate monitoring or large Brocade SAN Switches with blade modules.

Ethernet ports are better identified with their device name (e.g. eth0) instead of an SNMP-based number.

Cisco MDS SAN Switches

Added the monitoring of the x-bar modules as a separate "Blade" with an overall status

Cisco UCS

Cisco UCS Blade servers in Cisco UCS chassis show their customized "user label" in the console.

The actual location of Cisco UCS blade servers in the chassis is now shown in the Infobox of the corresponding instance, instead of in the name of the instance itself. This information is also displayed in the event that it is triggered upon a failure.

The objects representing Cisco UCS blade servers are now identified by their actual model names instead of a product code.

EMC Clariion, Symmetrix, V-Max, VNX

Components have more meaningful identifiers.

Components are now always properly attached to the appropriate enclosures.

Storage Processors, Batteries and Ethernet port monitoring added.

Queries and code optimized to allow more and larger disk arrays to be monitored from the same Patrol Agent.

IPMI-enabled Servers (including Cisco UCS, IBM xSeries, Hitachi ComputeBlade, Sun x86 and x64)

Object naming has been improved to better identify the monitored device.

McData SAN Switches

Added the monitoring of Fans, Backplanes, Control Processor Cards, Serial Crossbars, UPM Cards.

Port numbering in SAN switches monitored through the Fibre Alliance connector reflects accurately the numbering in their corresponding administration interface.

Ethernet ports are better identified with their device name (e.g. eth0) instead of an SNMP-based number.

Sun SPARC running Solaris (sun4u and sun4v)

Addition of a detection criterion to monitor the number of prtpicl processes. A large number of prtpicl processes indicate that the picld service has failed and leads to no collect values / missing alerts.

Deactivation of the ipmitool connector on sun4v servers. Monitoring for these types of servers is performed using prtpicl or the Sun snapshot utility.

Better handling of the overall unstability of the picld service on Sun Solaris

Properly retrieves the model name of the server on Sun Fire 480r and V490 systems in the output of the prtpicl command, and  old instances are properly handled to prevent the retention of missing components

CPU faults not related to a specific core are now properly reported through the fmadm utility.

The total memory size was added to the overall memory status parameter if individual memory modules status is not available from prtdiag.

Thresholds Management

In order to optimize the discovery process and improve the scalability of the solution, several important changes have been implemented in the way alert thresholds are managed by the KM.

Default thresholds are set in the agent’s configuration, only once, the first time the KM runs. In previous versions, the KM would set default alert thresholds during each discovery (every hour by default) according to its internal policy, thus overwriting any changes that were made manually by the administrators. Administrators can now customize the default thresholds using the Event Management KM or PCM (PATROL Configuration Manager) and do not need to use the KM’s interface to make sure these customization are overwritten by the discovery of the KM.
Thresholds are set globally at the class level (or at least for a group of instances). In previous versions, the KM would set alert thresholds for each and every instance of each class, leading to extremely large configuration files (which could cause PCM to crash in some occasions). A limited set of parameters still require alert thresholds to be set at the instance level. This results in a much smaller agent configuration file and vastly improved performance.
The “Modify Thresholds” KM Commands has been disabled from most classes, as the thresholds can now be managed by PATROL administrators in a standard way through the Event Management KM and PCM (or any other method used in their environment).
The “Alert After N Times” KM Command has been removed, as this setting can now be managed by PATROL administrators in a standard way through the Event Management KM and PCM.
To reset the alert thresholds to their default values (as when the KM first runs), the administrator can either use the “Reinitialize KM” KM Command with the “Reset alert thresholds” option enabled or use the Event Management KM or PCM to delete the corresponding configuration variables. The KM resets the threshold configuration variables to their default values when such variables do not exist.
The “Thresholds Mechanism Selection” KM Command configures the location of the default alert thresholds configuration variables that are set during the first initialization of the KM: either in the /AS or in the /___tuning___ configuration tree. It is recommended to leave this setting to “Automatic” to make sure the thresholds configuration variables are stored in the proper location and avoid instability in the threshold settings.

noteDuring the upgrade from an earlier version, the KM will remove all of the instance-specific thresholds configuration variables. These numerous identical instance-specific thresholds are replaced with global thresholds at the class level. Thresholds that had been manually customized through the KM interface remain in place.

Java Settings

Java Installation Package: The product can now be installed with an optional embedded Java Runtime Environment, thus preventing failures of the KM when a compatible version of Java could not be found.
Java Detection: The KM now properly avoids Java Runtime Environment instances that predates version 1.5.00
Java Detection: The automatic detection of a suitable JRE has been modified to optimize its utilization throughout the KM.
Java Settings: A username and a password can now be specified to execute java instead of using the PATROL default account.

Miscellaneous

Improved Performance: Overall performance has been dramatically improved for the discovery and collection. The KM is now able to handle the monitoring of several large storage systems or hundreds of servers with a total of thousands monitored devices.
Units Standardization: Parameter units have been standardized to ensure a complete consistency throughout the KM.
IPMI Query Timeout: The IPMI queries timeout is now set to 300 seconds (5 minutes) by default, instead of 120 seconds (2 minutes) in previous versions.
Localhost Monitoring Disabled: The MS_HW_DisableLocalHost.hdf connector has been removed from the KM as it is now possible to remove the monitoring of the local host entirely by right-clicking the Hardware on localhost icon > KM Commands > Remove this System.

Fixed Issues

Timeout errors and false alerts on Cisco SAN Switches: the KM now isolates the network card (show interface) command from the other collects, as it can take 10 minutes to complete, to allow other components to collect and to prevent no collects/timeouts for other classes.
Collection errors with deactivated parameters: An error message of type “No collect value available” was displayed when trying to collect information that was expected to be unavailable.
Unexpected disk instances and/or missing CPUs in IBM xSeries running Linux and Windows: The connector has been modified to prevent untimely invalid disk instances, and CPUs detection is now based on the CPU speed instead of whether or not the CPU has a description.
Errors in the SOW during the calculation the DegreesBelowWarning parameter: Sensors where the actual temperature measure is not available are now properly discarded when calculating the DegreesBelowWarning value. This avoids the display of an error message in the System Output Window. The error message had no effect on the calculation of the DegreesBelowWarning parameter and the monitoring of the temperature sensors.
False Alerts on Network Cards on IBM AIX: All Driver Flag lines are now searched to determine status to prevent false alerts when additional flags are added, typically "Debug".
False Alerts on Link Status on Sun Solaris servers: The LinkStatus parameter that intermittently reported links to be plugged or unplugged was creating false alerts for links that were definitively down.
Telnet/SSH-based Collection: The monitoring of remote systems through Telnet or SSH didn't work from a Linux or UNIX system. The monitoring of the following systems was therefore impossible from a remote UNIX or Linux system:
Cisco MDS SAN switches
HP 9000 GSP and HP Integrity MP cards
Dell CMC and DRAC chassis
HP BladeSystem Onboard Administrator
HP DotHill
Sun ALOM and ILOM cards
Any remote Linux/UNIX system
Monitoring Dell EqualLogic PS Series Disk Arrays: No error was raised when the physical disk status was "Offline".
Reporting: Traffic Reports were not generated if at least one of the selected parameters was offline.
Monitoring HP EVA arrays using SSSU: The LinkSpeed parameter was not properly calculated for Network Interfaces.
Monitoring Sun Servers via their ILOM systems: The Invalid OID error messages were displayed in the SOW as the connector did not handle situations where sensors had either a numerical or a discrete value. The issue did not occur when the sensor had both a numerical and a discrete value.
Monitoring servers configured with MPIO: The "WMI - Disks" connector's status was repeatedly going into alarm on servers configured with MPIO. The connector is now deactivated if only MPIO disks (LUN Multi-Path Disk Device) are found.
Cisco UCS blade chassis: Unneeded temperature sensors were created and "Invalid temperature value for instance..." error messages were displayed in the System Output Window of the consoles.
IBM xSeries systems: Some temperature sensors indicate a negative value which represents the number of degrees under the alert threshold. These sensors caused the KM to generate "Invalid Temperature value for instance [...]" error messages in the System Output Windows of the PATROL Agent every 2 minutes.
ipmitool: The KM failed to execute ipmitool on remote Linux or Solaris systems when 'sudo' was required to run the command.
Cisco UCS: Network interfaces can have "indeterminate" link speed in Cisco UCS Interconnect Switches. This specific status is now properly interpreted. The LinkSpeed parameter is not set to any value when the speed of the corresponding interface is indeterminate.
Sun Solaris: Sun Solaris picld daemon randomly failed to return the value of environmental sensors. The solution now properly handles such situations and no longer creates duplicate enclosure icons and false alarms ("Missing Device").
Cisco MDS SAN Switches: Some Cisco Fibre Switches do not report Power Supply status in the "show environment" command output. This caused the temperature sensor script to create fictitious sensors.
WBEM and Java 1.7: The Java-based WBEM client failed to connect to the ESX Host with the error message "CIM_ERR_FAILED (HTTP 501 - Not Implemented)" using java version 1.6.00_29 or higher.
NetApp Filers: The KM could not collect fan, temperature and power supply status from NetApp filers not equipped with individual sensors.
Missing Devices False Alerts: In certain cases, objects with an empty ID were created, leading to false "missing device" alerts.
Sun Blade Chassis monitoring failure: The monitoring of Sun Blade Chassis failed when setting up a remote monitoring. Some firmware versions of the CMM respond with a blank value for the first chassis sensor ID, and this caused the connector detection criteria to fail. (The affected versions of the firmware do not provide Power Supply Status information).
Remote command execution on UNIX servers: The syntax of the commands passed through the MS_HW_SSHRemoteUnixCommand.jar has been modified to enable the command to be executed as expected.
Hitachi servers: On Hitachi BladeSymphony servers, all IPMI sensors are now appropriately taken into account increasing the number of problems that can be accurately reported.
Brocade SAN Switches: Error messages were shown in the System Output Window of the PATROL Console when a great number of errors were encountered by a port with very low traffic. In such occasions, the ErrorPercent of the MS_HW_NETWORK class was not collected.
Sun Solaris systems: The KM was trying to collect statistics that were not available on Veritas heartbeat ports. Also, the KM now properly discovers and monitors the qfe network cards
Fujitsu PrimePower: The KM now properly discovers and monitors the fjgi network cards.