Description
By default, if the CPU utilization of an OBServer node remains higher than 90% for more than 5 minutes and the CPU utilization of an SQL statement on the OBServer node exceeds 15%, the system triggers an oas_anomaly_sql_from_anomaly_event_analysis_cpu_percent_high alert. Then, OceanBase Cloud Platform (OCP) analyzes the CPU time trends of the SQL statement in the last 24 hours. If the CPU time of the SQL statement is higher than the lowest value in the last 24 hours, the system determines that the execution performance of the SQL statement has degraded and triggers the oas_anomaly_sql_from_anomaly_event_analysis_perf_degradation alert.
Principle
| Parameter | Value |
|---|---|
| Metric | None |
| Source | None |
| Collected metric | None |
| Metric expression | None |
| Collection cycle | None |
Alert rule
By default, this alert is triggered if the CPU utilization of an OBServer node remains higher than 90% for more than 5 minutes, the CPU utilization of an SQL statement on the OBServer node exceeds 15%, and the CPU time of the SQL statement is higher than the lowest value in the last 24 hours. The alerting criteria are the same as the diagnostic criteria for the diagnostic result Deteriorated performance on the Suspicious SQL tab.
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on monitored conditions | Critical | Cluster |
Alert templates
Details
Template:
The CPU utilization of the host ${svr_ip} in the ${ob_cluster_name} cluster reaches ${server_cpu_percent}% during the period from ${start_time} to ${end_time}. The execution performance of the SQL statement ${sql_id} in the ${db_name} database of the ${tenant_name} tenant has degraded. The CPU utilization of the SQL statement is ${sql_cpu_percent}%. The average CPU time is ${sql_cpu_time_ms} ms. The average response time is ${sql_elapsed_time_ms} ms. The baseline period is from ${baseline_start_time} to ${baseline_end_time}. During the baseline period, the average CPU time is ${sql_baseline_cpu_time_ms} ms, and the average response time is ${sql_baseline_elapsed_time_ms} ms. Recommended action: [Refresh Plan Cache] [Throttle].Example:
Overview: alarm_template_id=0:ob_cluster=admin-4:tenant_name=a****:db_name=test:sql_id=6CE3107B0430F8CA5AAE8B5094EC****:recommend_operations=FLUSHING_PLAN_CACHE,RATE_LIMIT. The CPU utilization of the host exceeds the threshold because the execution performance of the SQL statement has degraded. Period: 2023-09-05T23:51:17 to 2023-09-05T23:56:47 Symptom: The CPU utilization of the OBServer node xxx.xxx.xxx.xxx reaches 22.09%. Associated SQL statement: SQL ID: 6CE3107B0430F8CA5AAE8B5094EC****. Database: test. Tenant: a****. Hit rule: The execution performance degrades. Basic information of the SQL statement: The CPU utilization of the SQL statement is 15.01%. The average CPU time is 77.29 ms. The average response time is 77.36 ms. Baseline period: 2023-09-04T23:55:47 to 2023-09-04T23:56:17 Information about the baseline SQL statements: The average CPU time is 0.19 ms. The average response time is 0.26 ms. Recommended action: [Refresh Plan Cache] [Throttle].
Impact on the system
The CPU utilization of the OBServer node is affected.
Possible causes
- The SQL statement is poorly or frequently executed.
- The volume of business data increases or the leader role is switched.
- The SQL execution plan has changed.
Solutions
- Refresh the plan cache. To do so, you can execute the alert clearance plan. For more information, see Execute the alert clearance plan.
- Optimize the SQL statement.
- Perform throttling based on the business assessment.