Description
This alert is triggered when the number of network packet receiving and transmission errors detected by the driver program exceeds 5 in 30 consecutive seconds.
Principle
| Parameter | Value |
|---|---|
| Metric | net_traffic_err_count |
| Source | This metric is a basic host monitoring metric collected by node_exporter. |
| Collected metric | node_network_receive_errs_total, node_network_transmit_errs_total |
| Metric expression | sum(delta(node_network_receive_errs_total{@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(delta(node_network_transmit_errs_total{@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| net_traffic_err_count > 5 | The number of network packet receiving and transmission errors during the unit time. | 5 | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_traffic_error
- Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, average error count ${value_shown} exceeds ${alarm_threshold}.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_traffic_error, average error count 10 exceeds 5.
Impact on the system
This alert indicates that network packet receiving or transmission errors occur in the service on the host. If the errors come from an OBServer, a network error occurs on the OBServer and the availability of the OBServer may decrease.
Possible causes
- The system load is too high.
- Other network errors occur.
Suggested solutions
View other monitoring metrics of the host, such as the network interface card (NIC) traffic, load, memory usage, and CPU utilization. The network error may be related to the system performance stress if it is high.
View the system load. High CPU utilization, high memory usage, and high I/O load may all lead to network timeouts. For example, CPU utilization is high and no sufficient CPU resources are available for processing packets, thereby causing a network timeout; memory usage is high and the application cannot process packets in a timely manner; I/O load is high, which makes the CPU very busy and unable to process network packets, causing a network timeout. You need to check for the causes of the high system load and resolve the problem of insufficient resources.
Run
ifconfigto view the statistics. In the statistics, errors indicate the number of abnormal network packets that are received or transmitted by the NIC.Capture network packets to identify the process with the network error as well as the causes.