In database operations and maintenance scenarios, high CPU usage or uneven CPU usage is very common. The causes in this scenario can be broadly categorized into the following two directions:
The first is CPU abnormalities caused by cluster environment issues. For example, network interface card interrupts, disk failures, driver failures, or third-party services (such as security services) may cause other processes to preempt part of the CPU resources on the node where the OBServer resides, resulting in abnormal CPU usage shown in monitoring.
The second is that the observer process itself occupies excessive CPU resources. For example, for SQL execution, remote execution, background threads, or RPC execution.
When identifying a scenario with CPU abnormalities, you must first use Linux system monitoring tools, such as tsar, to rule out the first type of problem. This topic focuses on locating the second type of problem, that is, troubleshooting why the observer process occupies a large amount of CPU resources.
Before troubleshooting high CPU usage or uneven CPU usage, let's first introduce the general troubleshooting approach and common causes.
By examining the foreground and background loads in the ASH Report, locate the root cause in the following scenarios:
The tenant's leader node receives more internal RPC/inner SQL requests, forming a CPU hotspot.
Troubleshooting approach: Check the background load in the ASH Report to determine how the internal RPC/inner SQL are generated, and analyze them in relation to the SQL dimension.
Inappropriate concurrency settings for PX parallel execution lead to CPU explosion.
Troubleshooting approach: In the ASH Report, you will see that most database tasks are related to PX parallel execution. Analyze them in relation to the SQL dimension and adjust the concurrency.
A series of background tasks, represented by minor compactions, consume excessive CPU resources.
Troubleshooting approach: Check the background load in the ASH Report to determine how the internal RPC/inner SQL are generated, and analyze them in relation to the SQL dimension.
Internal retry events in OceanBase Database consume excessive CPU resources.
Troubleshooting approach: Check the retry-type wait events in the ASH Report and analyze them in relation to the SQL dimension.
Drill down through the ASH Report to analyze and locate the root cause in the following scenarios:
ODP routing errors cause SQL traffic to concentrate on one node.
Troubleshooting approach: Check the foreground and background loads in the ASH Report to see if the foreground load is concentrated on one node or if there are many events with
Actionset toOB_REMOTE_SYNC_EXECUTEin the background load. Then, based on thepartition_hitcolumns inGV$OB_SQL_AUDIT, check whether the ODP routing hit its target.Uneven table structure distribution leads to uneven SQL pressure.
Troubleshooting approach: After checking the foreground and background loads to locate hot nodes, tenants, etc., then check the tablet_id access patterns in the ASH Report data to identify the tablet_id with the highest access count and perform balancing.
Frequent-use accounts cause user SQL to be hot in certain partitions.
Troubleshooting approach: Analyze the partition data corresponding to the frequently-used accounts and plan data distribution appropriately.
Procedure
Step 1: Confirm whether the process with high CPU usage on the node is the observer process
Before troubleshooting, confirm whether the process with high CPU usage on the specified node is the observer process using the following statement.
ps -eo pid,user,%cpu,cmd --sort=-%cpu
A sample output is as follows:
PID USER %CPU CMD
124648 user_a 99.9 other_process
1332 user_b 50.5 observer
Based on the return result:
If the process with high CPU usage is the observer, further troubleshoot OceanBase internal threads.
If it is another process (such as other_process), contact the relevant user for adjustment.
Step 2: Troubleshoot OceanBase internal threads
Troubleshoot suspicious threads in OBServer.
If the observer process occupies a high CPU, you can use the following command to view its thread status:
top -p `pidof observer` -HThis command displays the CPU usage of each thread in the observer process, helping you determine if tenant worker threads are consuming too much resources.
Analyze the ASH Report for overall load and identify hot tenants.
View the Top Active Tenants section in the ASH Report to locate hot tenants based on the information provided.
Determine whether the average active sessions (
Average Active Sessions) for the same tenant is higher than the tenant'sCPU cores * 4(estimated asCPU cores allocated to the tenant * 4):If
Average Active Sessionsis high, it indicates the current system load is at a high level, and the tenant is a hot tenant.Then, determine whether the tenant's load is concentrated in the foreground or background.
If the
Avg Active Sessionsfor foreground sessions is relatively high, it means the tenant's load is concentrated in foreground sessions.If the
Avg Active Sessionsfor background sessions is relatively high, it means the tenant's load is concentrated in background sessions.
Identify hot nodes.
View the Top Node Load section in the ASH Report. Determine whether the difference between the average active sessions (Avg Active Sessions) for foreground and background sessions under the same IP address exceeds 20% to identify if the current node is a hot node.
If so, it indicates there is node-level load imbalance in the system.
If not, no special attention is required for the execution information of any particular node.
Analyze database time consumption.
After identifying hot tenants or hot nodes, you can analyze the current database time consumption.
If the load is concentrated in foreground sessions, you need to further analyze the content of the Top Foreground DB Time section.
For foreground sessions, three types of events consume CPU resources: events with an
Event Namevalue ofON CPU, and a small number of CPU-intensive wait events. Specifically:ON CPU: The database considers the current session to be consuming CPU resources to process tasks.Wait events:
Wait events starting with
retry: This indicates that some type of retry is in progress within OceanBase Database. The internal retry mechanism of OceanBase Database consumes some CPU resources.exec inner sql wait: This indicates that the result of an SQL query accessing an internal table has returned during execution, such as retrieving user table locations or schema information. This process also consumes some CPU resources.
If the load is concentrated in background sessions, you need to further analyze the content of the Top Background DB Time section.
For background sessions, in addition to paying attention to the three types of events mentioned for foreground sessions, you also need to consider the properties of background tasks. The three fields
Program,Module, andActionlabel the properties of background tasks:If the value corresponding to the
Actionfield isOB_DAS_ASYNC_ACCESS, it indicates that the background task is an SQL remote DAS execution.If the value corresponding to the
Programfield is something likexxx_PxPool, it indicates that the background task is a PX parallel execution.If the value corresponding to the
Programfield is something likexxx_DAG, it indicates that the background task is a minor compaction or major compaction task.If the value corresponding to the
Programfield is something likexxx_RPC_REQUEST, it indicates that the background task is an RPC task sent internally within the OBServer node.
Note
The above are common types of background tasks. In addition, there are hundreds of other types of background tasks.
After the analysis above, you can already identify the nodes, tenants, and specific task types where the CPU overhead is mainly concentrated, and basically confirm where the main CPU overhead of the database system lies. Generally, the troubleshooting steps above can help us locate the general direction of the problem and potential bottleneck points. However, in some scenarios, we still wish to drill down further to obtain more information.
Based on the previous analysis, if the background task overhead is mainly in PX or DAS execution, you need to know what the corresponding sql_id is. You can construct the following SQL query to find it.
Note
Modify the filter conditions in the query based on your actual environment, such as the time range (
sample_time), node IP address (svr_ip), and tenant ID (tenant_id).obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- Operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'--Corresponding time period AND svr_ip = '192.168.1.1' -- Hot node AND tenant_id = 1002 -- Hot tenant AND program LIKE '%xxxx%' -- Matches the Program where the Event Name value is ON CPU GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;The following examples illustrate this.
Example 1:
*_SQL_CMDindicates that the user SQL request in the current database consumes excessive CPU resources. The root cause may be uneven data distribution, ODP routing errors, etc.You can use the following SQL statement to locate high-consumption SQL statements.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, -- Operator ID program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND program LIKE '%_SQL_CMD%' -- Filter user SQL requests GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 2:
*_RPC_PROCESS*indicates that processing RPC requests consumes excessive CPU resources. The root cause may be that a series of background tasks, such as minor compactions, consume too much CPU resources; the tenant's leader node receives more internal RPC/inner SQL requests.You can use the following SQL statement to locate the high-consumption SQL.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.100' AND tenant_id = 1 AND (program LIKE '%_RPC_PROCESS%') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;Example 3:
*PX*or*_DAS_*indicates that remote execution consumes excessive CPU resources. The root cause may be an inappropriate concurrency setting for PX parallel execution.You can use the following SQL statement to locate the high-consumption SQL.
obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, name.name as event, count(*) as cnt, max(sample_time), min(sample_time) FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#` WHERE sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00' AND svr_ip = '192.168.1.1' AND tenant_id = 1002 AND (IN_PX_EXECUTION = 'Y' or IN_REMOTE_DAS_EXECUTION = 'Y') GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event ORDER BY cnt desc LIMIT 200;
Based on the information obtained, refer to the relevant documentation for troubleshooting or contact technical support for assistance.
