Troubleshoot CPU resource imbalance|V4.4.2|OceanBase Database| docs|Distributed Database

Troubleshoot CPU resource imbalance

Last Updated：2026-06-10 08:33:57 Updated

In database operations, high or uneven CPU usage is very common. This scenario can be caused by the following two main factors:

The first factor is cluster environment anomalies leading to CPU issues. For example, network card interrupts, disk failures, driver failures, or third-party services (such as security services) can cause some CPU resources on the OBServer node to be preempted by other processes, resulting in CPU anomalies as observed in monitoring.
The second factor is the observer process itself consuming excessive CPU resources. For example, SQL execution, remote execution, background threads, or RPC execution.

When identifying CPU anomalies, the first step is to use Linux system monitoring tools, such as tsar, to rule out the first type of issues. This article focuses on the second type of issues, specifically troubleshooting why the observer process is consuming a significant amount of CPU resources.

Before troubleshooting CPU usage issues, it's essential to understand the general troubleshooting approach and common causes.

By examining the ASH Report's foreground and background loads, identify the root causes of the following scenarios:
- The leader node of a tenant receives more internal RPC/inner SQL requests, creating a CPU hotspot.
  
  Troubleshooting approach: Check the ASH Report's background load to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
- CPU overload due to improper settings for PX parallel execution concurrency.
  
  Troubleshooting approach: In the ASH Report, most database tasks are related to PX parallel execution. Analyze them by SQL dimension and adjust the concurrency settings.
- A series of background tasks, such as data dump and merge, consume excessive CPU resources.
  
  Troubleshooting approach: Check the ASH Report's background load to identify how internal RPC/inner SQL requests are generated and analyze them by SQL dimension.
- Internal retry events in OceanBase Database consume excessive CPU resources.
  
  Troubleshooting approach: Check the retry-related wait events in the ASH Report and analyze them by SQL dimension.
By drilling down into the ASH Report, identify the root causes of the following scenarios:
- ODP routing errors causing SQL traffic to concentrate on a single node.
  
  Troubleshooting approach: Check the ASH Report's foreground and background loads. If the foreground load is concentrated on a single node or there are many Action events with the value OB_REMOTE_SYNC_EXECUTE in the background load, check if ODP routing is hitting the correct partitions using the partition_hit column in GV$OB_SQL_AUDIT.
- Uneven table structure distribution leading to uneven SQL pressure.
  
  Troubleshooting approach: Identify hot nodes and tenants based on the foreground and background loads. Then, check the tablet_id access patterns in the ASH Report data to find the tablet_id with the highest access count and balance it.
- Frequent user account usage causing hotspots in specific partitions.
  
  Troubleshooting approach: Analyze the partition data corresponding to frequently used accounts and plan data distribution accordingly.

Process overview

Step 1: Confirm that the high-CPU process is an observer process

Before troubleshooting, confirm whether the high-CPU process on the specified node is an observer process. The SQL statement is as follows:

ps -eo pid,user,%cpu,cmd --sort=-%cpu

Sample output:

PID      USER     %CPU    CMD
124648   user_a   99.9    other_process
1332     user_b   50.5    observer

Based on the returned result:

If the high-CPU process is an observer, further troubleshooting of the OceanBase internal threads is required.
If it is another process (such as other_process), contact the relevant user for adjustment.

Step 2: Check the internal threads of OceanBase Database

Check the suspicious threads in the OBServer node.

If the observer process occupies a high CPU, you can run the following command to view the threads of the observer process:
```
top -p `pidof observer` -H
```
This command displays the CPU usage of each thread in the observer process, helping you determine whether the tenant work threads are consuming excessive resources.
Check the overall load based on the ASH report and locate the hot tenants.

Check the Top Active Tenants section in the ASH report to locate the hot tenants.

Determine whether the average active sessions (Average Active Sessions) of a tenant is higher than the CPU cores * 4 of the tenant (estimated based on the CPU cores allocated to the tenant * 4):
1. If the Average Active Sessions is high, it indicates that the system load is at a high level, and the tenant is a hot tenant.
2. Determine whether the load of the tenant is concentrated in the foreground or background.
  1. If the Avg Active Sessions of the foreground sessions is high, it indicates that the load of the tenant is concentrated in the foreground sessions.
  2. If the Avg Active Sessions of the background sessions is high, it indicates that the load of the tenant is concentrated in the background sessions.
Locate the hot node.

Check the Top Node Load section in the ASH report. If the difference between the average active sessions of the foreground and background sessions under the same IP address is greater than 20%, the node is a hot node.
- If the node is a hot node, it indicates that the load is unevenly distributed at the node level.
- If the node is not a hot node, you do not need to pay special attention to the execution information of the node.
Analyze the database time.

After locating the hot tenant or hot node, you can analyze the database time.
- If the load is concentrated in the foreground sessions, you need to analyze the content of the Top Foreground DB Time section.
  
  For foreground sessions, three types of events consume CPU resources: events with the Event Name value of ON CPU and a few waiting events with CPU overhead. Specifically:
  - ON CPU: The database considers that the current session is consuming CPU resources to process tasks.
  - Waiting events:
    - Waiting events starting with retry: Indicates that OceanBase Database is retrying a certain type of task internally. This retry mechanism consumes some CPU resources.
    - exec inner sql wait: Indicates that the results of an SQL statement that accesses an internal table are returned during execution. For example, the SQL statement retrieves the location of a user table or schema information. This process also consumes some CPU resources.
- If the load is concentrated in the background sessions, you need to analyze the content of the Top Background DB Time section.
  
  For background sessions, in addition to the three types of events mentioned for foreground sessions, you also need to pay attention to the attributes of background tasks. The Program, Module, and Action fields indicate the attributes of background tasks:
  - If the value of the Action field is OB_DAS_ASYNC_ACCESS, the background task is an SQL remote DAS execution task.
  - If the value of the Program field is similar to xxx_PxPool, the background task is a PX parallel execution task.
  - If the value of the Program field is similar to xxx_DAG, the background task is a major compaction task.
  - If the value of the Program field is similar to xxx_RPC_REQUEST, the background task is an RPC task sent internally by an OBServer node.
  Note
  
  The above are common types of background tasks. There are hundreds of other types of background tasks.

After the above analysis, you can determine the nodes, tenants, and task types where the CPU overhead is concentrated. You can basically confirm where the main CPU overhead of the database system is. Generally, the above troubleshooting steps can help you locate the main direction of the problem and potential bottlenecks. However, in some scenarios, you may want to further analyze to obtain more information.

If the overhead of background tasks is mainly in PX or DAS execution, you need to know the corresponding sql_id. You can run the following SQL statement to query the sql_id.

Note

Modify the query conditions such as the time range (sample_time), node IP address (svr_ip), and tenant ID (tenant_id) based on your actual environment.

obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id,
     svr_ip, svr_port, sql_id,
     sql_plan_line_id, -- Operator ID
     program, module, action, time_model,
     name.name as event,
     count(*) as cnt,
     max(sample_time),
     min(sample_time)
     FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash
     LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#`
     WHERE
     sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'-- Corresponding time period
     AND svr_ip = '192.168.1.1' -- Hot node
     AND tenant_id = 1002 -- Hot tenant
     AND program LIKE '%xxxx%' -- Match Program with Event Name value of ON CPU
     GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event
     ORDER BY cnt desc
     LIMIT 200;

The following examples are provided for your reference.

Example 1: *_SQL_CMD indicates that the user SQL requests of the current database consume excessive CPU resources. The root cause may be data distribution imbalance, ODP routing errors, and so on.

You can run the following SQL statement to locate high-consumption SQL statements.

obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id,
   svr_ip, svr_port, sql_id,
   sql_plan_line_id, -- Operator ID
   program, module, action, time_model,
   name.name as event,
   count(*) as cnt,
   max(sample_time),
   min(sample_time)
   FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash
   LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#`
   WHERE
   sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'
   AND svr_ip = '192.168.1.1'
   AND tenant_id = 1002
   AND program LIKE '%_SQL_CMD%' -- Filter user SQL requests
   GROUP BY tenant_id, svr_ip, svr_port, sql_id, sql_plan_line_id, program, module, action, time_model, event
   ORDER BY cnt desc
   LIMIT 200;

Example 2: *_RPC_PROCESS*, which indicates that excessive CPU resources are consumed in processing RPC requests. The root cause may be that a series of background tasks, such as major compactions, consume excessive CPU resources. The leader node of a tenant accepts more internal RPC/inner SQL requests.

You can run the following SQL to locate high-consumption SQL statements.

obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id,
   svr_ip, svr_port, sql_id, program, module, action, time_model,
   name.name as event,
   count(*) as cnt,
   max(sample_time),
   min(sample_time)
   FROM oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash
   LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#`
   WHERE
   sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'
   AND svr_ip = '192.168.1.100'
   AND tenant_id = 1
   AND (program LIKE '%_RPC_PROCESS%')
   GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event
   ORDER BY cnt desc
   LIMIT 200;

Example 3: *PX* or *_DAS_*, which indicates that excessive CPU resources are consumed in remote execution. The root cause may be that the concurrency level of PX parallel execution is improperly set.

You can run the following SQL to locate high-consumption SQL statements.

obclient(root@sys)[oceanbase]> SELECT ash.tenant_id as tenant_id,
   svr_ip, svr_port, sql_id, program, module, action, time_model,
   name.name as event,
   count(*) as cnt,
   max(sample_time),
   min(sample_time)
   FROM
   oceanbase.CDB_WR_ACTIVE_SESSION_HISTORY ash
   LEFT JOIN oceanbase.V$EVENT_NAME name on ash.event_no = name.`event#`
   WHERE
   sample_time BETWEEN '2025-09-09 20:00:00' AND '2025-09-09 20:10:00'
   AND svr_ip = '192.168.1.1'
   AND tenant_id = 1002
   AND (IN_PX_EXECUTION = 'Y' or IN_REMOTE_DAS_EXECUTION = 'Y')
   GROUP BY tenant_id, svr_ip, svr_port, sql_id, program, module, action, time_model, event
   ORDER BY cnt desc
   LIMIT 200;

Based on the information obtained, refer to the relevant documentation for handling or contact technical support for assistance.

OceanBase

Customer Stories

Documentation

Troubleshoot CPU resource imbalance

Process overview

Step 1: Confirm that the high-CPU process is an observer process

Step 2: Check the internal threads of OceanBase Database

Note

Note

References