Description
This alert is triggered when the CPU utilization of observer processes exceeds the threshold.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_cpu_percent |
| Source | select /*+read_consistency(weak)*/ tenant_name, tenant_id, stat_id, value from v$sysstat, __all_tenant where stat_id IN (140005, 140006) and (con_id > 1000 or con_id = 1) and __all_tenant.tenant_id = v$sysstat.con_id;
Note |
| Collected metric | sysstat_value |
| Metric expression | 100 * sum(sysstat_value{metric_group="sysstat",stat_id="140006",@LABELS}) by (@GBLABELS) / sum(sysstat_value{metric_group="sysstat",stat_id="140005",@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 second |
The value of the metric ob_cpu_percent indicates the CPU utilization of the observer processes in the cluster. When this value exceeds the threshold, this alert is triggered. The default threshold is 90%.
Note
- When the statistical event ID (stat_id) is 140005, the maximum number of available CPU cores of observer processes in the cluster is collected.
- When the stat_id is 140006, the number of CPU cores used by observer processes in the cluster is collected.
The ratio of the sum of the values of the value field when stat_id is 140006 to the sum of the values of the value field when stat_id is 140005 is used as the CPU utilization of the observer processes in the cluster.
Alert rule
| Metric | Default threshold (unit: %) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_cpu_percent | 90 | 60 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The CPU utilization of observer processes is ${value}%, exceeding the threshold of ${alarm_threshold}%.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of observer processes exceeds the threshold.
Details example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of observer processes is 91.0%, exceeding the threshold of 90.0%.
Impact on the system
High CPU utilization reduces the system throughput and increases the request latency.
A transient increase in CPU utilization does not generate a great impact. If the CPU utilization remains high for a long time, it must be handled.
Possible cause
This problem is commonly found during complex SQL statement execution.
Suggested solutions
Check whether the tenant_cpu_percent_over_threshold alert is triggered.
If yes, troubleshoot and resolve the issue. For more information, see tenant_cpu_percent_over_threshold.
Otherwise, go to the next step.
Check whether the load of observer processes simultaneously increases in multiple tenants. The load of observer processes running on the same OBServer stacks. The alert is triggered when the total load exceeds the threshold.
To reduce high CPU utilization caused by simultaneous load increase in multiple tenants, perform the following steps:
Scale out the cluster.
Add OBServers to the OceanBase cluster. For more information, see Add an OBServer.
Apply throttling to the OceanBase cluster. For more information, see Apply throttling to an OceanBase cluster.