Alert description
This alert is triggered when the CPU usage of an OBServer node exceeds the threshold.
Alerting principle
The following table describes the key parameters involved in the alerting monitoring logic.
| Parameter | Value |
|---|---|
| Monitoring metric | ob_host_cpu_percent |
| Data source | Collected by the node_exporter process |
| Collected metric | node_cpu_seconds_total |
| Monitoring expression | 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS)) The monitoring expression uses LABELS to differentiate data, and the following LABELS are included: |
| Collection interval | 1 second |
The value of the monitoring metric ob_host_cpu_percent indicates the CPU usage of the server where the OBServer node is located. An alert is triggered when the usage exceeds the threshold (100% by default).
Rule Information
| Monitoring Metric | Default Threshold (Unit: %) | Duration | Detection Cycle | Elimination Cycle |
|---|---|---|---|---|
| ob_host_cpu_percent | 100 | 60 seconds | 60 seconds | 5 minutes |
Alert Information
| Alert Trigger Method | Alert Level | Scope |
|---|---|---|
| Expression based on monitoring metrics | Critical | Server |
Alert Template
Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1:svr_ip=xxx.xxx.xxx.xxx Server CPU usage exceeds the limit
Alert Details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: CPU usage ${value_shown} exceeds ${alarm_threshold} %.
- Example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: Server CPU usage exceeds the limit, CPU usage 101.0 % exceeds 100.0 %.
Alert Recovery
- Template: Alert: ${alarm_name}, Server CPU usage: ${value_shown}
- Example: Alert: Server CPU usage exceeds the limit, Server CPU usage: 85 %
Impact on the system
A sudden surge in CPU usage has a minimal impact on the system, but prolonged high CPU usage can lead to a decrease in system throughput and increased request latency.
Possible causes
This issue commonly occurs in the following scenarios:
The OBServer node is executing complex SQL queries.
Other programs running on the host are consuming excessive CPU resources.
Solution
Verify whether the high CPU usage is caused by the observer process.
Run the
topcommand on the OBServer node that triggered the alert to identify the process consuming excessive CPU resources.If it is the observer process, it may also trigger the following alerts:
ob_cpu_percent_over_threshold CPU usage exceeds the threshold in OB statistics
tenant_cpu_percent_over_threshold CPU usage exceeds the threshold in OB tenants
High CPU usage in the observer process can be caused by complex SQL queries executed on the OBServer node, which may trigger both this alert and the one mentioned above.
First, refer to the documentation to resolve the above alerts, and then check if this alert continues to be triggered.
If it is triggered, proceed to the next step.
If it is not triggered, the issue has been resolved.
If it is another process, proceed to the next step.
The high CPU usage may be caused by another process.
Contact the DBA or an O&M engineer. If there are processes that are not essential for business operations, they can be shut down.