Description
This alert is triggered when the CPU utilization of an OceanBase tenant exceeds the threshold.
The CPU utilization of an OceanBase tenant = The CPU resources used by the tenant/Total CPU resources of all units assigned to the tenant in the current primary zone
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_cpu_percent Note: The metric indicates the percentage of CPU usage by a tenant in the cluster. By default, when the percentage is greater than 95%, this alert is triggered. |
| Source | SQL: select /*+read_consistency(weak)*/ tenant_name, tenant_id, stat_id, value from v$sysstat, __all_tenant where stat_id IN (140006, 140005) and (con_id > 1000 or con_id = 1) and __all_tenant.tenant_id = v$sysstat.con_id; Note: The value of the value field is used as the value of sysstat_value, and the values of other fields are used as labels. |
| Collected metric | sysstat_value |
| Metric expression | 100 * sum(sysstat_value{metric_group="sysstat",stat_id="140006",@LABELS}) by (@GBLABELS) / sum(sysstat_value{metric_group="sysstat",stat_id="140005",@LABELS}) by (@GBLABELS) Note: |
| Collection cycle | 1 second |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Warning | Tenant |
Alert rule
| Metric | Default threshold (unit: %) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_cpu_percent{app="OB"} | 95 | 0 seconds | 60 seconds | 5 minutes |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The CPU utilization is ${value}%, exceeding the threshold of ${alarm_threshold}%.
Overview example: ob_cluster=C1-1000:tenant_name=tenant-1:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of the OceanBase tenant exceeds the threshold.
Details example: ob_cluster=C1-1000:tenant_name=tenant-1:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of the OceanBase tenant reaches 96.0%, exceeding the threshold of 95.0%.
Impact on the system
No impact
Possible causes
This problem is commonly found in the following scenarios:
The application queries a large amount of data or generates hotspot data.
The resource plan of a tenant cannot cope with business requirements or hotspot data is generated.
Suggested solutions
Check whether the load is normal for the application.
Go to the Performance Monitoring page of the tenant that generated the alert in the OceanBase Cloud Platform (OCP) console. View the CPU Utilization curve chart on the Performance & SQL tab. Check whether the CPU utilization at the alert time surged abruptly in comparison with the values in the last one to seven days.
If yes, the CPU utilization is abnormal.
Otherwise, the CPU utilization is normal.
If the CPU utilization surge is caused by normal traffic, you need to increase the unit specifications of the tenant and allocate more CPU resources to it. For more information, see Manage unit specifications.
The high CPU utilization may be caused by querying a large amount of data or hotspot traffic.
You can perform the following steps based on the actual scenarios.
A large amount of data is queried in SQL execution
In the left-side navigation pane, click SQL Diagnosis to go to the TopSQL tab, and then check whether SQL queries with high CPU utilization exist.
If yes, optimize the SQL query that causes high CPU utilization.
Otherwise, perform the next step.
The high CPU utilization is caused by slow SQL queries.
In the left-side navigation pane, click SQL Diagnosis to go to the SlowSQL tab, and then check the diagnosis result.
Optimize the slow SQL queries you found.
The high CPU utilization is caused by hotspot rows of a specific node in the tenant.
Apply throttling to the tenant. For more information, see Apply throttling to an OceanBase cluster.