General approach to troubleshooting using ASH|V4.4.2|OceanBase Database| docs|Distributed Database

General approach to troubleshooting using ASH

Last Updated：2026-04-02 06:23:56 Updated

ASH (Active Session History) is a core diagnostic tool in OceanBase Database that records the history of active sessions. It helps DBAs quickly identify performance bottlenecks and root causes of issues.

Based on the complexity of troubleshooting, the problems that ASH can address are categorized into two types:

Type 1 issues: These can be identified and resolved by analyzing specific metrics in ASH. Examples include abnormal CPU or I/O resource usage, bottlenecks caused by specific wait events, and resource consumption at the session level.
Type 2 issues: These require integrating information from multiple dimensions in ASH using a divergent methodology. These issues demand a higher level of expertise from ASH users.

Based on the above categorization, ASH has significant advantages over traditional monitoring in the following scenarios:

Real-time dynamic troubleshooting: When SQL response times suddenly spike or system resources encounter bottlenecks, traditional monitoring cannot observe all active sessions in the database. ASH can capture fine-grained information in real time.
Cross-node troubleshooting in distributed systems: OceanBase Database's SQL execution often involves multiple OBServer nodes. ASH can associate execution information across nodes using identifiers like trace_id or sql_id, enabling end-to-end tracing.
SQL execution stage-level diagnostics: ASH can observe SQL operator-level execution, providing insights that are not available through plan_monitor or other methods by combining this with wait events and execution stages.
Resource contention and competition analysis: When multiple sessions contend for the same resource (such as locks or cache) or a single session consumes excessive CPU resources, ASH can precisely troubleshoot by tracking real-time active sessions and associating wait events.
Session-level troubleshooting: For scenarios such as a single session experiencing long-term anomalies, blocking, or long transactions, you can directly filter by session_id to analyze its historical behavior, providing more comprehensive diagnostic information compared to processlist.
Overhead analysis of background tasks: In the ASH model, background tasks are treated as background sessions. They are marked with attributes like Program, Module, and Action, and their execution overhead can be analyzed in fine-grained detail using dimensions like wait events.
Cold start and initialization issues: Scenarios such as frequent session connection establishment and disconnection, slow SQL parsing, and slow plan generation are often blind spots for traditional monitoring.

General troubleshooting approach

Step 1: Identify the the performance issue

You can identify the performance issue by observing the following signs:

A decrease in QPS, an increase in RT, an abnormal CPU load, or a surge in I/O wait.
Cross-verification with external tools such as OCP monitoring and Linux system tools (tsar/iostat).

Step 2: Obtain an ASH report

You can obtain an ASH report based on your preliminary analysis of the current scenario.

When you obtain an ASH report for the first time, we recommend that you do not specify any parameters except for the time range. This way, you can obtain a report for the entire cluster to understand the overall status. Then, based on the preliminary analysis results, you can obtain more detailed reports for specific tenants, nodes, or SQL_IDs. HTML reports provide better readability and interactivity. We recommend that you use HTML reports.

Generate a basic report (for the entire cluster).

At least two parameters are required: the start time and end time of the report.

For example, in the sys tenant, execute the following command to obtain an ASH report for the entire cluster.
```
obclient(root@sys)[oceanbase]> CALL DBMS_WORKLOAD_REPOSITORY.ASH_REPORT('2025-09-23 17:14:00', '2025-09-30 18:14:00',report_type=>'html') \G
```

Obtain an ASH report for a specific SQL_ID.

Assume that the SQL_ID is 1CA21C86CCF10D8635BF62C17BEA7128. In the sys tenant, execute the following command to obtain an ASH report.

obclient(root@sys)[oceanbase]> CALL DBMS_WORKLOAD_REPOSITORY.ASH_REPORT('2025-09-23 17:14:00', '2025-09-30 18:14:00', 
                                        sql_id=>'1CA21C86CCF10D8635BF62C17BEA7128',report_type=>'html') \G

Obtain an ASH report for a specific OBServer node.

Assume that the IP address of the OBServer node is 192.168.0.1. In the sys tenant, execute the following command to obtain an ASH report.

obclient(root@sys)[oceanbase]> CALL DBMS_WORKLOAD_REPOSITORY.ASH_REPORT('2025-09-23 17:14:00', '2025-09-30 18:14:00', 
                                        svr_ip=>'192.168.0.1', svr_port=>'2828',report_type=>'html') \G

Step 3: Interpret the report and locate the issue

Preliminary issue location.

All resource consumption (CPU, I/O, locks, and so on) is ultimately reflected in wait events. ASH records these events by sampling. Focus on the following report modules:
- Top Foreground DB Time: identifies the main time-consuming events in foreground sessions.
- Top Blocking Sessions: quickly identifies the blocking source.
- Top IO Events: distinguishes the I/O sources (log writing, user SQL, and background tasks).
Typical scenario identification:
- Row lock blocking: the row lock wait event occupies a significant proportion.
- CPU bottleneck: you need to comprehensively judge based on background tasks, parallel execution, retry events, and so on.
- I/O bottleneck: the palf write event, DAG-related wait events, or disk reads associated with SQL statements appear.
Perform detailed issue location based on the preliminary issue location.
- Blocking source location. This applies to lock wait scenarios.
  
  When the row lock wait event occupies a significant proportion in the current ASH, you can perform further issue location as follows:
  1. Obtain the top blocking sessions.
    
    Here is an example:
```
obclient(root@sys)[oceanbase]> SELECT blocking_session_id, count(1) AS cnt FROM oceanbase.GV$OB_ACTIVE_SESSION_HISTORY
    WHERE sample_time BETWEEN '2025-09-23 17:14:00' AND '2025-09-30 18:14:00' 
    AND event = 'row lock wait' 
    GROUP BY blocking_session_id 
    ORDER BY cnt desc;
```
    Based on the query result:
    - If a blocking_session_id occupies a significant proportion, it indicates that a long transaction or slow query is holding a lock without releasing it.
    - If multiple blocking_session_ids occupy similar proportions, it indicates concurrent hotspots, which may be caused by frequent updates to a single row, concentrated access to hot partitions, or uneven data distribution among different accounts.
  2. Obtain the transaction ID of the transaction that holds the lock.
    
    Based on the blocking session ID (blocking_session_id) obtained in the previous step, obtain the transaction ID of the transaction that holds the lock. Here is an example:
```
obclient(root@sys)[oceanbase]> SELECT tx_id, count(1) as cnt FROM oceanbase.GV$OB_ACTIVE_SESSION_HISTORY
    WHERE sample_time BETWEEN '2025-09-23 17:14:00' AND '2025-09-30 18:14:00' 
    AND blocking_session_id = xxxx 
    GROUP BY tx_id 
    ORDER BY cnt desc;
```
  3. Obtain the SQL_ID of the SQL statement that holds the lock.
    
    Based on the transaction ID (tx_id) obtained in the previous step, obtain the SQL_ID of the SQL statement that holds the lock. Here is an example:
```
obclient(root@sys)[oceanbase]> SELECT * FROM oceanbase.GV$OB_SQL_AUDIT
    WHERE tx_id = xxx 
    ORDER BY request_time desc\G
```
    By performing the above progressive queries, you can precisely locate the blocking session, transaction, and SQL statement.
- Resource hotspot location. This applies to scenarios where CPU and I/O resources are unevenly distributed.
  
  When the CPU and I/O issues are obvious, you can perform the following troubleshooting steps:
  - Check the partition_hit column in the GV$OB_SQL_AUDIT view to confirm whether the ODP routing is balanced.
  - Analyze the access frequency of tablet_id in ASH to identify hot partitions.
  - Check the Leader distribution. The Leader nodes of a tenant will bear more internal RPCs and inner SQLs. You need to check the backend load in ASH.
  - If a large number of PX tasks appear in the ASH report, you need to adjust the parallelism based on the SQL statements.
  - Wait events with DAG-related Program usually indicate I/O or CPU consumption caused by minor and major compactions.
  For more information about the troubleshooting steps, see Troubleshoot CPU usage imbalance and Troubleshoot I/O bottlenecks.

Solutions

For row lock blocking, you can try the following methods:
- Optimize the business logic that holds the lock to shorten the transaction duration.
- Split hot rows or implement application-layer throttling.
- Continuously monitor the Top Blocking Sessions in the ASH report.
For CPU usage imbalance, you can try the following methods:
- Correct the ODP routing rules to ensure balanced traffic.
- Adjust the partitioning strategy to disperse hot tablets.
- Optimize the PX parallelism parameter.
- Refactor the business logic of SQL statements with high retry events.
For I/O bottlenecks, you can try the following methods:
- Optimize the execution plans of SQL statements with high disk reads.
- Adjust the trigger timing of minor and major compactions.
- Expand the I/O resources or use higher-performance storage media.

OceanBase

Customer Stories

Documentation