Description
This alert is triggered when CPU utilization of the OBServer exceeds the threshold.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_host_cpu_percent |
| Source | The data is collected by using the node_exporter process. |
| Collected metric | node_cpu_seconds_total |
| Metric expression | 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS)) Note: Data in the metric expression is identified based on labels. The following labels are used: |
| Collection cycle | 1 second |
The value of the metric ob_host_cpu_percent indicates the CPU utilization of the OBServer. When the CPU utilization exceeds the threshold, this alert is triggered. The default threshold is 100%.
Alert rule
| Metric | Default threshold (unit: %) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_host_cpu_percent | 100 | 60 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The CPU utilization is ${value}%, exceeding the threshold of ${alarm_threshold}%.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of the OBServer exceeds the threshold.
Details example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of the OBServer is 101.0%, exceeding the threshold of 100.0%.
Impact on the system
Transient high CPU utilization has little impact on the system. However, lasting high CPU utilization reduces the system throughput and increases the request latency.
Possible causes
This problem is commonly found in the following scenarios:
The OBServer executes complex SQL statements.
Other applications running on the host use too many CPU resources.
Suggested solutions
Check whether the alert is caused by high CPU utilization of the observer process.
Run the
topcommand on the OBServer where the alert is triggered to check for processes that use too many CPU resources.If the observer process uses too many CPU resources, the following alerts may be simultaneously triggered:
tenant_cpu_percent_over_threshold
If the OBServer executes complex SQL statements, the observer process may use a large number of CPU resources. As a result, the ob_host_cpu_percent_over_threshold alert and the preceding two alerts may be simultaneously triggered.
Resolve the preceding two alerts based on the suggested solutions provided in the respective topics, and then check whether the ob_host_cpu_percent_over_threshold alert persists.
If the alert persists, go to the next step.
If the alert is cleared, the issue is fixed.
If other processes use too many CPU resources, go to the next step.
Contact the database administrator or an O&M engineer to disable processes irrelevant to the application.