In database operations, high CPU usage or uneven CPU usage is very common. The causes of this scenario can be divided into two main directions:
The first is cluster environment anomalies leading to CPU anomalies. For example, network card failures, disk failures, driver failures, or third-party services (such as security services) can cause other processes to preempt some CPU resources on the node where the OBServer is located, resulting in CPU anomalies on the monitoring dashboard.
The second is that the observer process itself consumes excessive CPU resources. For example, SQL execution, remote execution, background threads, or RPC execution.
When identifying CPU anomalies, the first step is to use Linux system monitoring tools, such as tsar, to rule out the first type of issues. This article focuses on the second type of issues, specifically troubleshooting why the observer process consumes a large amount of CPU resources.
Before troubleshooting CPU usage issues, we will introduce the general troubleshooting approach and common causes.
By examining the front and background load in the ASH Report, we can identify the root causes of the following scenarios:
The leader node of the tenant receives more internal RPC/inner SQL requests, creating a CPU hotspot.
Troubleshooting approach: Examine the background load in the ASH Report to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
CPU overload due to improper settings for the degree of parallelism in PX parallel execution.
Troubleshooting approach: In the ASH Report, most database tasks are related to PX parallel execution. Analyze them by SQL dimension and adjust the degree of parallelism.
A series of background tasks, such as minor and major compactions, consume excessive CPU resources.
Troubleshooting approach: Examine the background load in the ASH Report to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
Internal retry events in OceanBase Database consume excessive CPU resources.
Troubleshooting approach: Examine retry-related wait events in the ASH Report and analyze them by SQL dimension.
By drilling down into the ASH Report, we can identify the root causes of the following scenarios:
ODP routing errors cause SQL traffic to concentrate on a single node.
Troubleshooting approach: Examine the front and background load in the ASH Report. If the front load is concentrated on a single node or the background load has many events with
Actionset toOB_REMOTE_SYNC_EXECUTE, check whether ODP routing is hit by examining thepartition_hitcolumn inGV$OB_SQL_AUDIT.Uneven distribution of table structures leads to uneven SQL pressure.
Troubleshooting approach: Examine the front and background load to identify hot nodes and tenants. Then, examine the tablet_id access patterns in the ASH Report data to find the tablet_id with the highest access count and balance it.
Frequent use of a specific account causes SQL traffic to concentrate on certain partitions.
Troubleshooting approach: Analyze the data in the frequently used account's partitions and plan the data distribution accordingly.
Process
Step 1: Confirm that the process with high CPU usage on the node is an observer process
Before troubleshooting, confirm whether the process with high CPU usage on the specified node is an observer process. The statement is as follows:
ps -eo pid,user,%cpu,cmd --sort=-%cpu
Sample output:
PID USER %CPU CMD
124648 user_a 99.9 other_process
1332 user_b 50.5 observer
Based on the return result:
If the process with high CPU usage is an observer process, further troubleshooting of OceanBase internal threads is required.
If it is another process (such as other_process), contact the relevant user to make adjustments.
Step 2: Check the internal threads of OceanBase Database
Check the suspicious threads in the OBServer node.
If the CPU usage of the observer process is high, you can run the following command to view the thread status:
top -p `pidof observer` -HThis command displays the CPU usage of each thread in the observer process, helping you determine whether the tenant working threads are consuming excessive resources.
Check the overall load and locate the hot tenant based on the ASH Report.
In the Top Active Tenants section of the ASH Report, you can locate the hot tenant based on the corresponding information.
Determine whether the average number of active sessions (
Average Active Sessions) of the tenant is higher than the tenant'sCPU cores * 4(estimated astenant allocated CPU cores * 4):If the
Average Active Sessionsis high, it indicates that the current system load is at a high level, and the current tenant is a hot tenant.Determine whether the load of the current tenant is concentrated in the foreground or background.
If the
Avg Active Sessionsof the foreground sessions is high, it indicates that the load of the current tenant is concentrated in the foreground sessions.If the
Avg Active Sessionsof the background sessions is high, it indicates that the load of the current tenant is concentrated in the background sessions.
Locate the hot node.
In the Top Node Load section of the ASH Report, you can determine whether the current node is a hot node by comparing the average number of active sessions (Avg Active Sessions) of the foreground and background sessions under the same IP address. If the difference is greater than 20%, the current node is a hot node.
If yes, it indicates that the load is unevenly distributed across nodes in the system.
If not, you do not need to pay special attention to the execution information of any specific node.
Analyze the database time.
After locating the hot tenant or hot node, you can analyze the database time.
If the load is concentrated in the foreground sessions, you need to further analyze the content of the Top Foreground DB Time section.
For foreground sessions, three types of events consume CPU resources: events with the
Event Namevalue ofON CPUand a few waiting events that consume CPU resources. Specifically:ON CPU: The database considers that the current session is consuming CPU resources to process tasks.Waiting events:
Waiting events starting with
retry: indicate that OceanBase Database is retrying a certain type of operation internally, which consumes some CPU resources.exec inner sql wait: indicates that the results of an SQL statement accessing an internal table are returned during execution, such as user table location and schema information. This process also consumes some CPU resources.
If the load is concentrated in the background sessions, you need to further analyze the content of the Top Background DB Time section.
For background sessions, in addition to the three types of events mentioned for foreground sessions, you also need to pay attention to the properties of background tasks. The
Program,Module, andActionfields indicate the properties of background tasks:If the value of the
Actionfield isOB_DAS_ASYNC_ACCESS, it indicates that the background task is an SQL remote DAS execution.If the value of the
Programfield is similar toxxx_PxPool, it indicates that the background task is a PX parallel execution.If the value of the
Programfield is similar toxxx_DAG, it indicates that the background task is a minor and major compaction task.If the value of the
Programfield is similar toxxx_RPC_REQUEST, it indicates that the background task is an RPC task sent internally by an OBServer node.
Note
The above are common types of background tasks. There are hundreds of other types of background tasks.
After the above analysis, you can determine the nodes, tenants, and specific task types where the CPU resources are mainly consumed. You can basically confirm the main areas where the CPU resources are consumed in the database system. Generally, the above troubleshooting steps can help you locate the main direction of the problem and potential bottlenecks. However, in some scenarios, you may want to further analyze to obtain more information.
Based on the previous analysis, if the CPU resources are mainly consumed by PX or DAS execution, you need to know the SQL ID of the corresponding SQL statement. You can run the following SQL statement to query the SQL ID.
Note
Modify the time range (
sample_time), node IP address (svr_ip), and tenant ID (tenant_id) in the query based on your actual environment.obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'--corresponding time period AND svr_ip = '192.168.1.1' -- hot node AND tenant_id = 1002 -- hot tenant AND program LIKE '%xxxx%' -- match Program with Event Name value of ON CPU GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;The following examples illustrate this.
Example 1:
*_SQL_CMD, indicates that the user SQL requests of the current database consume excessive CPU resources. The root cause may be data distribution imbalance, ODP routing errors, and so on.You can run the following SQL statement to locate the high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- Operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND program LIKE '%_SQL_CMD%' -- Filter user SQL requests GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 2:
*_RPC_PROCESS*, which indicates that excessive CPU resources are consumed in processing RPC requests. The root cause may be that a series of background tasks, such as minor and major compactions, consume excessive CPU resources. The leader node of a tenant accepts more internal RPC/inner SQL requests.You can run the following SQL statement to locate the high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.100' AND tenant_id = 1 AND (program LIKE '%_RPC_PROCESS%') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 3:
*PX*or*_DAS_*, which indicates that excessive CPU resources are consumed in remote execution. The root cause may be that the concurrency level of PX parallel execution is not set properly.You can run the following SQL statement to locate the high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND (IN_PX_EXECUTION = 'Y' or IN_REMOTE_DAS_EXECUTION = 'Y') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;
Based on the information obtained, refer to the relevant documentation for handling or contact technical support for assistance.