Alerts

Table of Contents

Alert Content
Alert Rules
Customizing Alert Content
Receiving Alerts

An alert is a notification that a hardware problem has occurred, such as a critical low speed on a fan leading to an increase in CPU temperature.

Hardware Sentry defines a set of conditions that trigger alerts when failures are detected. These alerts are sent as OpenTelemetry logs from the Hardware Sentry Agent's internal OTLP Exporter to the OpenTelemetry Collector's internal OTLP Receiver.

Alert Content

The alerts report:

the host's Fully Qualified Domain Name
the resource's attributes
the faulty component with its identifying information (Serial Number, Model, Manufacturer, Bios Version, Driver Version, Physical Address)
the parent dependency and its identifying information
the alert severity (WARN, ALARM)
the alert rule
the date at which the alert is triggered
the metric that triggered the alert
the status information of the component
the encountered problem, consequence and recommended action
a complete hardware health report on the faulty component

Here is an example of an alert triggered by an unplugged cable on a network interface. This alert log has been captured using the OpenTelemetry Logging Exporter:

2022-04-21T14:37:57.034+0200	DEBUG	loggingexporter/logging_exporter.go:81	ResourceLog #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource labels:
     -> agent.host.name: STRING(hws.internal.sentrysoftware.net)
     -> host.id: STRING(netapp9-san)
     -> host.name: STRING(netapp9-san.internal.sentrysoftware.net)
     -> host.type: STRING(storage)
     -> os.type: STRING(storage)
     -> site: STRING(data center 1)
ScopeLogs #0
ScopeLogs SchemaURL: 
InstrumentationScope netapp9-san 
LogRecord #0
Timestamp: 2022-04-21 12:37:47.201 +0000 UTC
Severity: WARN
Body: Hardware problem on netapp9-san.internal.sentrysoftware.net with 0c (FC Port).

Alert Severity    : WARN
Alert Rule        : hw.network.up == 0

Alert Details
=============
Problem           : The network link is down.
Consequence       : The network traffic (if any) that was processed by this adapter is no longer being handled, or is overloading another network adapter.
Recommended Action: Check that the network cable (if any) is not unplugged or broken/cut, and that it is properly plugged into the network card. Ensure that the network hub/switch/router is working properly.

Hardware Health Report (2022-04-21T14:37:47.201)
================================================

Monitor           : 0c (FC Port)
Type              : Network Card
On Host           : netapp9-san.internal.sentrysoftware.net
Monitor ID        : NetAppREST_networkcard_netapp9-san_netapp9-san-01.0c
Connector Used    : NetAppREST
Parent ID         : NetAppREST_enclosure_netapp9-san_netapp9-san-01
Physical Address  : 50:0a:09:83:80:72:2b:36

This object is attached to: Enclosure: netapp9-san-01 (NetApp FAS2650)
Type              : Enclosure
Manufacturer      : NetApp
Model             : FAS2650
Serial Number     : 651652000067

=================================================================
Metric: hw.network.up
-----------------------------------------------------------------
Current Value     : 0 (Unplugged)

=================================================================
Metric: hw.status{state="present", hw.type="network"}
-----------------------------------------------------------------
Current Value     : 1 (Present)

Attributes:
     -> agent.host.name: STRING(hws.internal.sentrysoftware.net)
     -> host.id: STRING(netapp9-san)
     -> host.name: STRING(netapp9-san.internal.sentrysoftware.net)
     -> host.type: STRING(storage)
     -> os.type: STRING(storage)
     -> site: STRING(data center 1)
Trace ID: 
Span ID: 
Flags: 0

Alert Rules

Alert rules are sets of conditions used to identify the alert's severity and whether the alert should be triggered or not. These alert rules apply to Hardware Sentry:

Monitor	Metric Name	Severity	Default Alert Conditions	Attributes
Connector	hardware_sentry.connector.status	ALARM	hardware_sentry.connector.status == 1	state = `failed`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `http`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `ipmi`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `snmp`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `ssh`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `wbem`
Host	hardware_sentry.host.up	ALARM	hardware_sentry.host.up == 0	protocol = `wmi`
Battery	hw.battery.charge	WARN	hw.battery.charge <= 0.5
Battery	hw.battery.charge	ALARM	hw.battery.charge <= 0.3
Battery	hw.status	ALARM	hw.status == 0	hw.type = `battery` state = `present`
Battery	hw.status	WARN	hw.status == 1	hw.type = `battery` state = `degraded`
Battery	hw.status	ALARM	hw.status == 1	hw.type = `battery` state = `failed`
Blade	hw.status	ALARM	hw.status == 0	hw.type = `blade` state = `present`
Blade	hw.status	WARN	hw.status == 1	hw.type = `blade` state = `degraded`
Blade	hw.status	ALARM	hw.status == 1	hw.type = `blade` state = `failed`
CPU	hw.errors	ALARM	hw.errors >= 1	hw.type = `cpu`
CPU	hw.status	WARN	hw.status == 1	hw.type = `cpu` state = `predicted_failure`
CPU	hw.status	ALARM	hw.status == 0	hw.type = `cpu` state = `present`
CPU	hw.status	WARN	hw.status == 1	hw.type = `cpu` state = `degraded`
CPU	hw.status	ALARM	hw.status == 1	hw.type = `cpu` state = `failed`
CPU Core	hw.status	ALARM	hw.status == 0	hw.type = `cpu_core` state = `present`
CPU Core	hw.status	WARN	hw.status == 1	hw.type = `cpu_core` state = `degraded`
CPU Core	hw.status	ALARM	hw.status == 1	hw.type = `cpu_core` state = `failed`
Disk Controller	hw.status	WARN	hw.status == 1	hw.type = `disk_controller` battery_state = `degraded`
Disk Controller	hw.status	ALARM	hw.status == 1	hw.type = `disk_controller` battery_state = `failed`
Disk Controller	hw.status	WARN	hw.status == 1	hw.type = `disk_controller` state = `degraded`
Disk Controller	hw.status	ALARM	hw.status == 1	hw.type = `disk_controller` state = `failed`
Disk Controller	hw.status	ALARM	hw.status == 0	hw.type = `disk_controller` state = `present`
Enclosure	hw.status	ALARM	hw.status == 1	hw.type = `enclosure` state = `open`
Enclosure	hw.status	ALARM	hw.status == 0	hw.type = `enclosure` state = `present`
Fan	hw.fan.speed	ALARM	hw.fan.speed == 0
Fan	hw.fan.speed	WARN	hw.fan.speed <= 500
Fan	hw.fan.speed_ratio	ALARM	hw.fan.speed_ratio == 0
Fan	hw.fan.speed_ratio	WARN	hw.fan.speed_ratio <= 0.05
Fan	hw.status	ALARM	hw.status == 0	hw.type = `fan` state = `present`
Fan	hw.status	WARN	hw.status == 1	hw.type = `fan` state = `degraded`
Fan	hw.status	ALARM	hw.status == 1	hw.type = `fan` state = `failed`
GPU	hw.errors	ALARM	hw.errors >= 1	hw.type = `gpu` type = `corrected`
GPU	hw.errors	ALARM	hw.errors >= 1	hw.type = `gpu` type = `all`
GPU	hw.gpu.memory.utilization	WARN	hw.gpu.memory.utilization >= 0.9
GPU	hw.gpu.memory.utilization	ALARM	hw.gpu.memory.utilization >= 0.95
GPU	hw.status	WARN	hw.status == 1	hw.type = `gpu` state = `predicted_failure`
GPU	hw.status	ALARM	hw.status == 0	hw.type = `gpu` state = `present`
GPU	hw.status	WARN	hw.status == 1	hw.type = `gpu` state = `degraded`
GPU	hw.status	ALARM	hw.status == 1	hw.type = `gpu` state = `failed`
LED	hw.status	WARN	hw.status == 1	hw.type = `led` state = `degraded`
LED	hw.status	ALARM	hw.status == 1	hw.type = `led` state = `failed`
Logical Disk	hw.errors	ALARM	hw.errors >= 1	hw.type = `logical_disk`
Logical Disk	hw.status	ALARM	hw.status == 0	hw.type = `logical_disk` state = `present`
Logical Disk	hw.status	WARN	hw.status == 1	hw.type = `logical_disk` state = `degraded`
Logical Disk	hw.status	ALARM	hw.status == 1	hw.type = `logical_disk` state = `failed`
LUN	hw.lun.paths	ALARM	hw.lun.paths < 1	type = `available`
LUN	hw.status	ALARM	hw.status == 0	hw.type = `lun` state = `present`
LUN	hw.status	WARN	hw.status == 1	hw.type = `lun` state = `degraded`
LUN	hw.status	ALARM	hw.status == 1	hw.type = `lun` state = `failed`
Memory Module	hw.errors	ALARM	hw.errors >= 1	hw.type = `memory`
Memory Module	hw.status	WARN	hw.status == 1	hw.type = `memory` state = `predicted_failure`
Memory Module	hw.status	ALARM	hw.status == 0	hw.type = `memory` state = `present`
Memory Module	hw.status	WARN	hw.status == 1	hw.type = `memory` state = `degraded`
Memory Module	hw.status	ALARM	hw.status == 1	hw.type = `memory` state = `failed`
Network Card	hw.network.bandwidth.utilization	WARN	hw.network.bandwidth.utilization >= 0.8
Network Card	hw.network.error_ratio	WARN	hw.network.error_ratio >= 0.2
Network Card	hw.network.error_ratio	ALARM	hw.network.error_ratio >= 0.3
Network Card	hw.network.up	WARN	hw.network.up == 0
Network Card	hw.status	ALARM	hw.status == 0	hw.type = `network` state = `present`
Network Card	hw.status	WARN	hw.status == 1	hw.type = `network` state = `degraded`
Network Card	hw.status	ALARM	hw.status == 1	hw.type = `network` state = `failed`
Other	hw.status	ALARM	hw.status == 0	hw.type = `other_device` state = `present`
Other	hw.status	WARN	hw.status == 1	hw.type = `other_device` state = `degraded`
Other	hw.status	ALARM	hw.status == 1	hw.type = `other_device` state = `failed`
Physical Disk	hw.physical_disk.endurance_utilization	WARN	hw.physical_disk.endurance_utilization <= 0.05	state = `remaining`
Physical Disk	hw.physical_disk.endurance_utilization	ALARM	hw.physical_disk.endurance_utilization <= 0.02	state = `remaining`
Physical Disk	hw.errors	ALARM	hw.errors >= 1	hw.type = `physical_disk`
Physical Disk	hw.status	WARN	hw.status == 1	hw.type = `physical_disk` state = `predicted_failure`
Physical Disk	hw.status	ALARM	hw.status == 0	hw.type = `physical_disk` state = `present`
Physical Disk	hw.status	WARN	hw.status == 1	hw.type = `physical_disk` state = `degraded`
Physical Disk	hw.status	ALARM	hw.status == 1	hw.type = `physical_disk` state = `failed`
Power Supply	hw.status	ALARM	hw.status == 0	hw.type = `power_supply` state = `present`
Power Supply	hw.status	WARN	hw.status == 1	hw.type = `power_supply` state = `degraded`
Power Supply	hw.status	ALARM	hw.status == 1	hw.type = `power_supply` state = `failed`
Power Supply	hw.power_supply.utilization	WARN	hw.power_supply.utilization >= 0.9
Power Supply	hw.power_supply.utilization	ALARM	hw.power_supply.utilization >= 0.99
Robotics	hw.status	ALARM	hw.status == 0	hw.type = `robotics` state = `present`
Robotics	hw.status	WARN	hw.status == 1	hw.type = `robotics` state = `degraded`
Robotics	hw.status	ALARM	hw.status == 1	hw.type = `robotics` state = `failed`
Tape Drive	hw.errors	ALARM	hw.errors >= 1	hw.type = `tape_drive`
Tape Drive	hw.status	WARN	hw.status == 1	hw.type = `tape_drive` state = `needs_cleaning`
Tape Drive	hw.status	ALARM	hw.status == 1	hw.type = `tape_drive` state = `needs_cleaning`
Tape Drive	hw.status	ALARM	hw.status == 0	hw.type = `tape_drive` state = `present`
Tape Drive	hw.status	WARN	hw.status == 1	hw.type = `tape_drive` state = `degraded`
Tape Drive	hw.status	ALARM	hw.status == 1	hw.type = `tape_drive` state = `failed`
Temperature	hw.status	ALARM	hw.status == 0	hw.type = `temperature` state = `present`
Temperature	hw.status	WARN	hw.status == 1	hw.type = `temperature` state = `degraded`
Temperature	hw.status	ALARM	hw.status == 1	hw.type = `temperature` state = `failed`
Virtual Machine	hw.status	ALARM	hw.status == 0	hw.type = `vm` state = `present`
Virtual Machine	hw.status	WARN	hw.status == 1	hw.type = `vm` state = `degraded`
Virtual Machine	hw.status	ALARM	hw.status == 1	hw.type = `vm` state = `failed`
Voltage	hw.status	ALARM	hw.status == 0	hw.type = `voltage` state = `present`
Voltage	hw.status	WARN	hw.status == 1	hw.type = `voltage` state = `degraded`
Voltage	hw.status	ALARM	hw.status == 1	hw.type = `voltage` state = `failed`

Customizing Alert Content

You can customize the content of alerts by adding macros in the hardwareProblemTemplate parameter in the config/hws-config.yaml file. See the procedure detailed in the Hardware Problem Template section.

The default alert content template is:

Hardware problem on ${FQDN} with ${MONITOR_NAME}.${NEWLINE}${NEWLINE}${ALERT_DETAILS}${NEWLINE}${NEWLINE}${FULLREPORT}

The following macros can be used to obtain more details about the problem. They will be replaced at runtime.

Macro	Description
`${MONITOR_NAME}`	Name of the monitor that triggered the alert. Example: Fan: 1.1 (CPU1)
`${MONITOR_ID}`	Unique identifier of the monitor that triggered the alert.
`${MONITOR_TYPE}`	Type of the monitor that triggered the alert. Example: Physical Disk
`${PARENT_ID}`	Identifier of the parent that the faulty instance is attached to.
`${METRIC_NAME}`	Name of the metric that triggered the alert. Example: hw.status{state=“failed”, hw.type = “battery”}
`${METRIC_VALUE}`	Value of the metric that triggered the alert. Example: 1 (Failed)
`${SEVERITY}`	Severity of the alert (ALARM, WARN)
`${ALERT_RULE}`	Alert conditions that triggered the alert. Example: hw.status{state=“failed”, hw.type = “battery”} == 1
`${ALERT_DATE}`	ISO date time at which the alert triggered.
`${CONSEQUENCE}`	Description of the possible consequence of the detected problem. Example: The temperature of the chip, component or device that was cooled by this fan should grow quickly. This can lead to severe hardware damage and system crashes.
`${RECOMMENDED_ACTION}`	Recommended action to solve the problem. Example: Check if the fan is no longer cooling the system. If so, replace the fan.
`${PROBLEM}`	Description of the problem encountered by the monitor. Example: The speed of this fan is critically low (1503 rpm).
`${ALERT_DETAILS}`	Severity, alert rule, problem, consequence and recommended action.
`${FULLREPORT}`	Full hardware health report about the monitor that triggered the alert.
`${NEWLINE}`	Linefeed. This is useful to produce multi-line information.

Receiving Alerts

To receive Hardware Sentry's alerts, your Exporter must support the OpenTelemetry logs pipeline.

For troubleshooting purposes, you can add logging in the service:pipelines:logs:exporters section of the otel/otel-config.yaml file:


service:
  # ...
  pipelines:
    # ...
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resourcedetection]
      exporters: [logging] # List here the platform of your choice

Alerts will then be exported to the console.

Search Results for {{siteSearch | truncate:'50'}}

No results.