ob_host_net_send_percent_over_threshold

2023-08-15 11:20:59  Updated

Description

This alert is triggered when the outbound bandwidth usage of the OBServer exceeds the threshold.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_host_net_send_percent
Source
  • bandwidth: The network bandwidth is collected by using the psutil.net_if_stats() function of the Python script. The collected data is converted to a value in bytes, where 1 Gbit/s = 1024 Mbit/s, 1 Mbit/s = 1024 Kbit/s, 1 Kbit/s = 1024 bit/s, and 1 byte = 8 bits.
  • node_network_transmit_bytes_total: The data is collected by using the node_exporter process and can be viewed on the local host at http://localhost:63000/metrics.
Collected metrics node_network_transmit_bytes_total and bandwidth
Metric expression 100 * max(sum(rate(node_network_transmit_bytes_total{@LABELS}[@INTERVAL]) by (device,@GBLABELS)) by (device,@GBLABELS) / sum(bandwidth{@LABELS}) by (device,@GBLABELS)) by (@GBLABELS)
Collection cycle 1 second

Note

The metric source of this alert is special. The network bandwidth usage of the local host is monitored by the OCP-Agent and the data is collected by using the Python script and the exporter process. For more information, see the description in the Source row of the preceding table.

The value of the metric ob_host_net_send_percent indicates the outbound bandwidth usage of the OBServer. When this value exceeds the threshold, this alert is triggered. The default threshold is 80%.

Alert rule

Metric Default threshold (unit: %) Duration Detection cycle Time before clearance
ob_host_net_send_percent 80 120 seconds 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Metric expression Critical Server

Alert templates

  • Overview: ${alarm_target} ${alarm_name}

  • Details: ${alarm_target} ${alarm_name}. The outbound bandwidth usage of the OBServer is ${value}%, exceeding the threshold of ${alarm_threshold}%.

  • Overview example: ob_cluster=C1-1000:svr_ip=192.168.0.1. The outbound bandwidth usage of the OBServer exceeds the threshold.

  • Details example: ob_cluster=C1-1000:svr_ip=192.168.0.1. The outbound bandwidth usage of the OBServer is 81.0%, exceeding the threshold of 80.0%.

Impact on the system

When the network bandwidth is fully utilized, the performance of the OceanBase cluster is bottlenecked. When business traffic continuously increases, the performance of the OceanBase cluster cannot meet the business requirements.

Possible causes

This problem is commonly found in cases where the business traffic of the OceanBase cluster increases. For example, the SQL execution returns a large amount of data.

Suggested solutions

Go to the Monitoring page of the host that triggered the alert in the OCP console, and then view the network throughput on the Host Performance tab.

Check the network throughput when the alert is triggered.

  • If the network throughput soars, it is likely that the business traffic sharply increases in a short time.

    Wait for the business traffic to restore to normal and check whether the alert is automatically cleared five minutes later.

  • If the network throughput steadily increases or remains high after a sharp rise, it reflects normal business growth.

    Contact the network engineer to increase the network bandwidth of the host.

Contact Us