os_tsar_traffic_overload

2025-03-26 07:47:21  Updated

Description

This alert is triggered when the bandwidth usage of the NIC eth.* exceeds 80%.

Principle

Parameter Value
Metric eth_net_traffic_usage, eth_net_bandwidth_mbps
Source These metrics are basic host monitoring metrics collected by node_exporter.
Collected metric node_network_receive_bytes_total, node_network_transmit_bytes_total, node_net_bandwidth_bps
Metric expression
  • eth_net_traffic_usage: 800 * (sum(rate(node_network_receive_bytes_total{device=~"eth.*",@LABELS}[@INTERVAL])) by (@GBLABELS) + sum(rate(node_network_transmit_bytes_total{device=~"eth.*",@LABELS}[@INTERVAL])) by (@GBLABELS)) / sum(node_net_bandwidth_bps{device=~"eth.*",@LABELS}) by (@GBLABELS)
  • eth_net_bandwidth_mbps: sum(node_net_bandwidth_bps{device=~"eth.*",@LABELS}) by (@GBLABELS) / 1000000
  • Collection cycle 1 second

    Alert rule

    Metric expression Metric description Default threshold (unit: percentage) Detection cycle Elimination cycle
    eth_net_traffic_usage > 80 and eth_net_bandwidth_mbps > 0
  • eth_net_traffic_usage: the proportion of bandwidth used for data receiving and transmission to the total bandwidth of the NIC eth.*.
  • eth_net_bandwidth_mbps: the total bandwidth of the NIC eth.*.
  • 80% 10 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Based on the expression of the metric Critical Server

    Alert templates

    • Overview
      • Template: ${alarm_target} ${alarm_name}
      • Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_traffic_overload
    • Details
      • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, NIC:${device}, alert: ${alarm_name}, NIC bandwidth usage ${eth_net_traffic_usage_value_zh_cn} exceeds ${eth_net_traffic_usage_alarm_threshold}%, NIC bandwidth is ${eth_net_bandwidth_mbps_value_zh_cn}.
      • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, NIC:eno1, alert: os_tsar_traffic_overload, NIC bandwidth usage 91.15% exceeds 80%, NIC bandwidth is 300 Mbit/s.

    Impact on the system

    When the NIC is exhausted, system resources such as CPU and I/O resources may be fully occupied, affecting system stability.

    Possible causes

    1. Throttling is not enabled for OBServers in the reconfirm phase, and this causes the gigabit NICs to work at full capacity.

      During a leader switchover, the leader needs to pull logs from the followers for reconfirmation. If throttling is not enabled and the followers use gigabit NICs, two followers will send logs to the leader concurrently and the NIC of the leader will work at full capacity.

    2. A large number of network requests exist in other network bandwidth-consuming scenarios.

    Suggested solutions

    1. View the network traffic statistics.

      ## Refresh the statistics every 5 seconds.
      sar -n DEV 5
      

      Find the device where the NIC shows the most frequent traffic changes. Frequent traffic changes indicate a high level of traffic.

    2. Run iftop on this device to find processes with high network traffic.

      Check whether the processes with high network traffic are as expected. If the observer process accounts for a high proportion of network traffic, contact Technical Support.

    Contact Us