Description
This alert is triggered when the inbound bandwidth usage of the OBServer exceeds the threshold.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_host_net_send_percent |
| Source | http://localhost:63000/metrics. |
| Collected metrics | node_network_receive_bytes_total and bandwidth |
| Metric expression | 100 * max(sum(rate(node_network_receive_bytes_total{@LABELS}[@INTERVAL]) by (device,@GBLABELS)) by (device,@GBLABELS) / sum(bandwidth{@LABELS}) by (device,@GBLABELS)) by (@GBLABELS) |
| Collection cycle | 1 second |
Note
The metric source of this alert is special. The network bandwidth usage of the local host is monitored by the OCP-Agent and the data is collected by using the Python script and the exporter process. For more information, see the description in the Source row of the preceding table.
The value of the metric ob_host_net_send_percent indicates the inbound bandwidth usage of the OBServer. When this value exceeds the threshold, this alert is triggered. The default threshold is 80%.
Alert rule
| Metric | Default threshold (unit: %) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_host_net_send_percent | 80 | 120 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}. The inbound bandwidth usage of the OBServer is ${value}%, exceeding the threshold of ${alarm_threshold}%.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The inbound bandwidth usage of the OBServer exceeds the threshold.
Details example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: The inbound bandwidth usage of the OBServer exceeds the threshold. The inbound bandwidth usage of the OBServer is 81.0%, exceeding the threshold of 80.0%.
Impact on the system
When the network bandwidth is fully utilized, the performance of the OceanBase cluster is bottlenecked. When business traffic continuously increases, the performance of the OceanBase cluster cannot meet the business requirements.
Possible causes
This problem is commonly found in cases where the business traffic of the OceanBase cluster increases, for example, more and more SQL queries are received.
Suggested solutions
Go to the Monitoring page of the host that triggered the alert in the OCP console, and then view the network throughput on the Host Performance tab.
Check the network throughput when the alert is triggered.
If the network throughput soars, it is likely that the business traffic sharply increases in a short time.
Wait for the business traffic to restore to normal and check whether the alert is automatically cleared five minutes later.
If the network throughput steadily increases or remains high after a sharp rise, it reflects normal business growth.
Contact the network engineer to increase the network bandwidth of the host.