In database operations, high or uneven CPU usage is very common. This scenario can be caused by the following two main factors:
The first factor is cluster environment anomalies leading to CPU issues. For example, network card interrupts, disk failures, driver failures, or third-party services (such as security services) can cause some CPU resources on the OBServer node to be preempted by other processes, resulting in CPU anomalies as observed in monitoring.
The second factor is the observer process itself consuming excessive CPU resources. For example, SQL execution, remote execution, background threads, or RPC execution.
When identifying CPU anomalies, the first step is to use Linux system monitoring tools, such as tsar, to rule out the first type of issues. This article focuses on the second type of issues, specifically troubleshooting why the observer process is consuming a significant amount of CPU resources.
Before troubleshooting CPU usage issues, it's essential to understand the general troubleshooting approach and common causes.
By examining the ASH Report's foreground and background loads, identify the root causes of the following scenarios:
The leader node of a tenant receives more internal RPC/inner SQL requests, creating a CPU hotspot.
Troubleshooting approach: Check the ASH Report's background load to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
CPU overload due to improper settings for PX parallel execution concurrency.
Troubleshooting approach: In the ASH Report, most database tasks are related to PX parallel execution. Analyze them by SQL dimension and adjust the concurrency settings.
A series of background tasks, such as data dump and merge, consume excessive CPU resources.
Troubleshooting approach: Check the ASH Report's background load to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
Internal retry events in OceanBase Database consume excessive CPU resources.
Troubleshooting approach: Check the retry-related wait events in the ASH Report and analyze them by SQL dimension.
By drilling down into the ASH Report, identify the root causes of the following scenarios:
ODP routing errors causing SQL traffic to concentrate on a single node.
Troubleshooting approach: Check the ASH Report's foreground and background loads. If the foreground load is concentrated on a single node or there are many
Actionevents with the valueOB_REMOTE_SYNC_EXECUTEin the background load, check if ODP routing is hitting the correct partitions using thepartition_hitcolumn inGV$OB_SQL_AUDIT.Uneven table structure distribution leading to uneven SQL pressure.
Troubleshooting approach: Identify hot nodes and tenants based on the foreground and background loads. Then, check the tablet_id access patterns in the ASH Report data to find the tablet_id with the highest access count and balance it.
Frequent user account usage causing hotspots in specific partitions.
Troubleshooting approach: Analyze the partition data corresponding to frequently used accounts and plan data distribution accordingly.
Process overview
Step 1: Confirm that the high-CPU process is an observer process
Before troubleshooting, confirm whether the high-CPU process on the specified node is an observer process. The SQL statement is as follows:
ps -eo pid,user,%cpu,cmd --sort=-%cpu
Sample output:
PID USER %CPU CMD
124648 user_a 99.9 other_process
1332 user_b 50.5 observer
Based on the returned result:
If the high-CPU process is an observer, further troubleshooting of the OceanBase internal threads is required.
If it is another process (such as other_process), contact the relevant user for adjustment.
Step 2: Check the internal threads of OceanBase Database
Check the suspicious threads in the OBServer node.
If the observer process occupies a high CPU, you can run the following command to view the threads of the observer process:
top -p `pidof observer` -HThis command displays the CPU usage of each thread in the observer process, helping you determine whether the tenant work threads are consuming excessive resources.
Check the overall load based on the ASH report and locate the hot tenants.
Check the Top Active Tenants section in the ASH report to locate the hot tenants.
Determine whether the average active sessions (
Average Active Sessions) of a tenant is higher than theCPU cores * 4of the tenant (estimated based on theCPU cores allocated to the tenant * 4):If the
Average Active Sessionsis high, it indicates that the system load is at a high level, and the tenant is a hot tenant.Determine whether the load of the tenant is concentrated in the foreground or background.
If the
Avg Active Sessionsof the foreground sessions is high, it indicates that the load of the tenant is concentrated in the foreground sessions.If the
Avg Active Sessionsof the background sessions is high, it indicates that the load of the tenant is concentrated in the background sessions.
Locate the hot node.
Check the Top Node Load section in the ASH report. If the difference between the average active sessions of the foreground and background sessions under the same IP address is greater than 20%, the node is a hot node.
If the node is a hot node, it indicates that the load is unevenly distributed at the node level.
If the node is not a hot node, you do not need to pay special attention to the execution information of the node.
Analyze the database time.
After locating the hot tenant or hot node, you can analyze the database time.
If the load is concentrated in the foreground sessions, you need to analyze the content of the Top Foreground DB Time section.
For foreground sessions, three types of events consume CPU resources: events with the
Event Namevalue ofON CPUand a few waiting events with CPU overhead. Specifically:ON CPU: The database considers that the current session is consuming CPU resources to process tasks.Waiting events:
Waiting events starting with
retry: Indicates that OceanBase Database is retrying a certain type of task internally. This retry mechanism consumes some CPU resources.exec inner sql wait: Indicates that the results of an SQL statement that accesses an internal table are returned during execution. For example, the SQL statement retrieves the location of a user table or schema information. This process also consumes some CPU resources.
If the load is concentrated in the background sessions, you need to analyze the content of the Top Background DB Time section.
For background sessions, in addition to the three types of events mentioned for foreground sessions, you also need to pay attention to the attributes of background tasks. The
Program,Module, andActionfields indicate the attributes of background tasks:If the value of the
Actionfield isOB_DAS_ASYNC_ACCESS, the background task is an SQL remote DAS execution task.If the value of the
Programfield is similar toxxx_PxPool, the background task is a PX parallel execution task.If the value of the
Programfield is similar toxxx_DAG, the background task is a major compaction task.If the value of the
Programfield is similar toxxx_RPC_REQUEST, the background task is an RPC task sent internally by an OBServer node.
Note
The above are common types of background tasks. There are hundreds of other types of background tasks.
After the above analysis, you can determine the nodes, tenants, and task types where the CPU overhead is concentrated. You can basically confirm where the main CPU overhead of the database system is. Generally, the above troubleshooting steps can help you locate the main direction of the problem and potential bottlenecks. However, in some scenarios, you may want to further analyze to obtain more information.
If the overhead of background tasks is mainly in PX or DAS execution, you need to know the corresponding sql_id. You can run the following SQL statement to query the sql_id.
Note
Modify the query conditions such as the time range (
sample_time), node IP address (svr_ip), and tenant ID (tenant_id) based on your actual environment.obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- Operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'-- Corresponding time period AND svr_ip = '192.168.1.1' -- Hot node AND tenant_id = 1002 -- Hot tenant AND program LIKE '%xxxx%' -- Match Program with Event Name value of ON CPU GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;The following examples are provided for your reference.
Example 1:
*_SQL_CMDindicates that the user SQL requests of the current database consume excessive CPU resources. The root cause may be data distribution imbalance, ODP routing errors, and so on.You can run the following SQL statement to locate high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- Operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND program LIKE '%_SQL_CMD%' -- Filter user SQL requests GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 2:
*_RPC_PROCESS*, which indicates that excessive CPU resources are consumed in processing RPC requests. The root cause may be that a series of background tasks, such as major compactions, consume excessive CPU resources. The leader node of a tenant accepts more internal RPC/inner SQL requests.You can run the following SQL to locate high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.100' AND tenant_id = 1 AND (program LIKE '%_RPC_PROCESS%') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 3:
*PX*or*_DAS_*, which indicates that excessive CPU resources are consumed in remote execution. The root cause may be that the concurrency level of PX parallel execution is improperly set.You can run the following SQL to locate high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND (IN_PX_EXECUTION = 'Y' or IN_REMOTE_DAS_EXECUTION = 'Y') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;
Based on the information obtained, refer to the relevant documentation for handling or contact technical support for assistance.
