ob_host_net_recv_percent_over_threshold

2023-08-15 02:30:56  Updated

Description

This alert is triggered when the inbound bandwidth usage of the OBServer exceeds the threshold.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_host_net_send_percent
Source
  • bandwidth: The network bandwidth is collected by using the psutil.net_if_stats() function of the Python script. The collected data is converted to a value in bytes, where 1 Gbit/s = 1024 Mbit/s, 1 Mbit/s = 1024 Kbit/s, 1 Kbit/s = 1024 bit/s, and 1 byte = 8 bits.
  • node_network_receive_bytes_total: The total amount of inbound data. The data is collected by using the node_exporter process and can be viewed on the local host at http://localhost:63000/metrics.
  • Collected metrics node_network_receive_bytes_total and bandwidth
    Metric expression 100 * max(sum(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL]) by (device,@GBLABELS)) by (device,@GBLABELS) / sum(bandwidth{@LABELS}) by (device,@GBLABELS)) by (@GBLABELS)
    Collection cycle 1 second

    Note

    The metric source of this alert is special. The network bandwidth usage of the local host is monitored by the OCP-Agent and the data is collected by using the Python script and the exporter process. For more information, see the description in the Source row of the preceding table.

    The value of the metric ob_host_net_send_percent indicates the inbound bandwidth usage of the OBServer. When this value exceeds the threshold, this alert is triggered. The default threshold is 80%.

    Alert rule

    Metric Default threshold (unit: %) Duration Detection cycle Time before clearance
    ob_host_net_send_percent 80 120 seconds 60 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Metric expression Critical Server

    Alert templates

    • Overview: ${alarm_target} ${alarm_name}

    • Details: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}. The inbound bandwidth usage of the OBServer is ${value}%, exceeding the threshold of ${alarm_threshold}%.

    • Overview example: ob_cluster=C1-1000:svr_ip=192.168.0.1. The inbound bandwidth usage of the OBServer exceeds the threshold.

    • Details example: Cluster: obcluster-1, Host: 192.168.0.1, Alert: The inbound bandwidth usage of the OBServer exceeds the threshold. The inbound bandwidth usage of the OBServer is 81.0%, exceeding the threshold of 80.0%.

    Impact on the system

    When the network bandwidth is fully utilized, the performance of the OceanBase cluster is bottlenecked. When business traffic continuously increases, the performance of the OceanBase cluster cannot meet the business requirements.

    Possible causes

    This problem is commonly found in cases where the business traffic of the OceanBase cluster increases, for example, more and more SQL queries are received.

    Suggested solutions

    Go to the Monitoring page of the host that triggered the alert in the OCP console, and then view the network throughput on the Host Performance tab.

    Check the network throughput when the alert is triggered.

    • If the network throughput soars, it is likely that the business traffic sharply increases in a short time.

      Wait for the business traffic to restore to normal and check whether the alert is automatically cleared five minutes later.

    • If the network throughput steadily increases or remains high after a sharp rise, it reflects normal business growth.

      Contact the network engineer to increase the network bandwidth of the host.

    Contact Us