os_tsar_traffic_error

2023-08-22 02:46:05  Updated

Description

This alert is triggered when the number of network packet receiving and transmission errors detected by the driver program exceeds 5 in 30 consecutive seconds.

Principle

Parameter Value
Metric net_traffic_err_count
Source This metric is a basic host monitoring metric collected by node_exporter.
Collected metric node_network_receive_errs_total, node_network_transmit_errs_total
Metric expression sum(delta(node_network_receive_errs_total{@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(delta(node_network_transmit_errs_total{@LABELS}[@INTERVAL])) by (@GBLABELS)
Collection cycle 1 second

Alert rule

Metric expression Metric description Default threshold Detection cycle Elimination cycle
net_traffic_err_count > 5 The number of network packet receiving and transmission errors during the unit time. 5 10 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Critical Server

Alert templates

  • Overview
    • Template: ${alarm_target} ${alarm_name}
    • Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_traffic_error
  • Details
    • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, average error count ${value_shown} exceeds ${alarm_threshold}.
    • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_traffic_error, average error count 10 exceeds 5.

Impact on the system

This alert indicates that network packet receiving or transmission errors occur in the service on the host. If the errors come from an OBServer, a network error occurs on the OBServer and the availability of the OBServer may decrease.

Possible causes

  • The system load is too high.
  • Other network errors occur.

Suggested solutions

View other monitoring metrics of the host, such as the network interface card (NIC) traffic, load, memory usage, and CPU utilization. The network error may be related to the system performance stress if it is high.

  1. View the system load. High CPU utilization, high memory usage, and high I/O load may all lead to network timeouts. For example, CPU utilization is high and no sufficient CPU resources are available for processing packets, thereby causing a network timeout; memory usage is high and the application cannot process packets in a timely manner; I/O load is high, which makes the CPU very busy and unable to process network packets, causing a network timeout. You need to check for the causes of the high system load and resolve the problem of insufficient resources.

  2. Run ifconfig to view the statistics. In the statistics, errors indicate the number of abnormal network packets that are received or transmitted by the NIC.

  3. Capture network packets to identify the process with the network error as well as the causes.

Contact Us