This article presents three beginner-friendly experiments and three key analysis dimensions, using real-world failure cases, to demonstrate how to quickly identify performance issues such as CPU spikes, I/O bottlenecks, and distributed execution blocking using the [G]V$OB_ACTIVE_SESSION_HISTORY view.
ASH implementation principle
ASH functions like an automatic camera for the database, taking snapshots of all active sessions (such as those executing SQL) every second. These snapshots are stored in the GV$OB_ACTIVE_SESSION_HISTORY system view. The implementation principles are as follows:
Active session filtering mechanism
Records all tasks executed in the database and assigns a unique session_id to each task, including:
- User client connections to the database for executing SQL requests.
- Internal RPC executions.
- Background thread tasks, such as dump threads, Clog threads, and timer threads.
Only records the status of active sessions, and idle sessions are not recorded, including:
- Sessions executing SQL are considered active. If a session is in a Sleep state and not processing SQL requests, it is considered idle and will not be recorded.
- Background threads that are not executing tasks or waiting for new task scheduling are considered idle and will not be recorded.
- Sessions waiting for resources (such as locks, disk I/O) are marked with their waiting events (such as db file data read) by ASH.
Periodic sampling mechanism
Within each OBServer, there is a dedicated ASH thread that samples all active sessions in the database every second, recording their status. Specifically:
- Each row in
GV$OB_ACTIVE_SESSION_HISTORYrepresents the status of an active session at a specific moment. - If a session's workload is very short (e.g., less than 1 second), it might not be captured by ASH, similar to blinking during a photo. For such scenarios, it is recommended to repeat the workload and expand the query time range to ensure more reliable ASH statistics.
- Each row in
Circular buffer design
ASH snapshots are stored in a 30MB circular buffer. When the stored data exceeds 30MB, the oldest data is automatically overwritten. In V4.2.5 BP3, the feature to automatically archive ASH data to WR before overwriting was introduced. However, in earlier versions, there was a risk of losing ASH history data. To preserve the latest ASH records, you can manually trigger a WR snapshot:
CALL DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();After executing this command, the current ASH snapshots not yet persisted to WR are persisted at a 10:1 ratio.
Beginner's ASH experiments
Three beginner-friendly experiments are prepared to open the ASH timeline for database analysis, allowing you to quickly experience the power of ASH.
Experiment 1: View the database status of the last 10 seconds (real-time monitoring).
Check what the database has been doing in the last 10 seconds:
-- View the system's status in the last 10 seconds
SELECT
sample_time AS Time, -- Timestamp precise to microseconds
session_id AS SessionID, -- Unique session ID
CASE WHEN session_state = 'ON CPU' THEN 'Working' ELSE 'Waiting' END AS Status, -- Working status: CPU busy or waiting for resources
event AS WaitReason -- Specific wait event (such as lock, I/O, etc.)
FROM v$ob_active_session_history
WHERE sample_time > now() - 10 -- Last 10 seconds
AND session_type="FOREGROUND"
ORDER BY sample_time DESC;
The results are as follows:
+----------------------------+------------+-----------+------------------------+
| Time | SessionID | Status | WaitReason |
+----------------------------+------------+-----------+------------------------+
| 2025-06-11 20:16:15.307564 | 3221931170 | Waiting | px loop condition wait |
| 2025-06-11 20:16:14.285204 | 3221928286 | Working | |
| 2025-06-11 20:16:14.285204 | 3221923503 | Waiting | wait in request queue |
| 2025-06-11 20:16:14.285204 | 3221923627 | Waiting | db file data read |
| 2025-06-11 20:16:14.285204 | 3221927472 | Waiting | sync rpc |
| 2025-06-11 20:16:13.262695 | 3221929034 | Working | |
| 2025-06-11 20:16:12.240768 | 3221927472 | Working | |
+----------------------------+------------+-----------+------------------------+
Observations:
- Session 3221931170 is waiting for px to complete (px loop condition wait)
- Session 3221923627 is waiting for read IO to complete (db file data read)
- Session 3221927472 is still working at 20:16:12, and after 2 seconds, it is waiting for the RPC to return results (sync rpc)
Experiment 2: Identify the most active sessions in the last 10 minutes (load distribution).
Check for busy sessions in the last 10 minutes:
-- Identify the most active sessions in the last 10 minutes
SELECT
session_id AS SessionID,
COUNT(*) AS WorkingSeconds
FROM v$ob_active_session_history
WHERE sample_time > now() - 600 -- Last 10 minutes
AND session_type='FOREGROUND'
GROUP BY session_id
ORDER BY WorkingSeconds DESC limit 3;
The result is as follows:
+------------+--------------+
| SessionID | WorkingSeconds |
+------------+--------------+
| 3221977564 | 283 |
| 3221972645 | 142 |
| 3221916432 | 77 |
+------------+--------------+
Session 3221977564 was active for 283 seconds in the last 10 minutes. If only these three sessions were active during this period, session 3221977564 contributed 283 / (283 + 142 +77)= 56% to the overall database load.
Experiment 3: Identify high-load SQL statements in the historical period (performance analysis).
Check for the most busy SQL statements in a past period:
-- Query the SQL_ID with the highest execution load in the past period
SELECT
SQL_ID,
COUNT(*) AS WorkingSeconds
FROM v$ob_active_session_history
WHERE sample_time BETWEEN
'2025-06-11 10:32:08'
AND '2025-06-11 11:32:07' -- Modify the time to the desired observation period
GROUP BY sql_id
ORDER BY WorkingSeconds DESC limit 3;
The result is as follows:
+----------------------------------+--------------+
| SQL_ID | WorkingSeconds |
+----------------------------------+--------------+
| 1D0BA376E273B9D622641124D8C59264 | 91265 |
| 19AAD9F2FE3CE0023298AB83F7E75775 | 13608 |
| 7BE7497CCCFE8978AD6B92A938D43929 | 13098 |
+----------------------------------+--------------+
The most busy SQL statement in the past period was 1D0BA376E273B9D622641124D8C59264, which executed for 91265 seconds.
ASH analysis method
The V$OB_ACTIVE_SESSION_HISTORY view provides second-level snapshots of all sessions, allowing you to trace back the database's operational status for the past 7 days without any prior configuration. The following three dimensions are the core metrics analyzed in this article, collectively depicting the database's operational status:
- AAS (Average Active Session): This represents the average number of active sessions per second, reflecting the database's load. A higher AAS indicates a greater load.
- Wait Events: When active sessions are waiting for certain system resources, such as locks, network RPCs, or I/O, they enter a wait event state. Conversely, sessions not in a wait event state are considered working (ON_CPU).
- Execution Phases: SQL statements go through various phases within the database, such as IN_PARSE, IN_SQL_OPTIMIZE, IN_SQL_EXECUTION, and IN_STORAGE_READ. During execution, SQL also utilizes various resources and enters different states, such as IN_RPC_ENCODE, IN_CONNECTION_MRG, and IN_SEQUENCE_LOAD.
Using the ASH analysis method, you can reconstruct the overall operational status of the database during specific historical periods, not just the microscopic dimensions represented by monitoring metrics. This is the key difference between ASH and traditional monitoring.
Active Session Analysis (AAS)
Given that ASH only records active sessions in the database, we can calculate the number of active sessions per second. A higher number of active sessions indicates a more busy and heavily loaded system. The AAS metric represents the average number of active sessions per second, providing insight into the database's load at a second-level granularity.
To query AAS, you can use the following SQL statement:
SELECT
DATE_FORMAT(SAMPLE_TIME, '%Y-%m-%d %H:%i') AS TIME_SLOT,
COUNT(*) / 60 AS AAS
FROM V$OB_ACTIVE_SESSION_HISTORY
WHERE sample_time BETWEEN
'2025-03-13 10:32:08'
AND '2025-03-13 10:35:07' -- Modify the time range as needed
GROUP BY TIME_SLOT
ORDER BY TIME_SLOT;
The results are as follows:
+------------------+------+
| TIME_SLOT | AAS |
+------------------+------+
| 2025-03-13 10:32 | 2 |
| 2025-03-13 10:33 | 13 |
| 2025-03-13 10:34 | 15 |
| 2025-03-13 10:35 | 1 |
+------------------+------+
During the query period, there was a sudden increase in active sessions between 10:33 and 10:34. This could be due to a surge in user SQL traffic or a sudden resource-intensive background task. To determine if there was a resource bottleneck, consider factors such as tenant specifications and system conditions. Note that the V$OB_ACTIVE_SESSION_HISTORY view only stores recent historical snapshots, so ensure that the historical data in the view has not been overwritten before querying.
Let's explore a real-world example to demonstrate how AAS is used in problem diagnosis and troubleshooting.
Case 1: CPU spike caused by background tasks
A company using OceanBase Database experienced a CPU spike at 8:00 AM. The user received an OCP alert and upgraded the tenant configuration to 6c20g by 8:14 AM, but the CPU usage remained at 100%. The issue lasted for 2 hours before resolving. During this time, the QPS on OCP did not change significantly.
To identify the cause of the load spike, we can query the ASH view. Since the fault period is known, we can directly analyze the AAS during this period.
SELECT
SVR_IP,
SVR_PORT,
SESSION_TYPE,
COUNT(*) as AAS
-- Total active sessions divided by duration to get AAS
FROM V$OB_ACTIVE_SESSION_HISTORY
WHERE sample_time BETWEEN
'2025-03-13 8:00:00'
AND '2025-03-13 10:00:00'
GROUP BY SVR_IP, SVR_PORT, SESSION_TYPE ORDER BY AAS DESC limit 5;
The query results are as follows:
+--------------+----------+--------------+-------+
| SVR_IP | SVR_PORT | SESSION_TYPE | AAS |
+--------------+----------+--------------+-------+
| xx.xx.xx.x6 | 46775 | BACKGROUND | 59633 |
| xx.xx.xx.x6 | 46775 | FOREGROUND | 1324 |
| xx.xx.xx.x5 | 46779 | FOREGROUND | 512 |
| xx.xx.xx.x9 | 46771 | FOREGROUND | 412 |
| xx.xx.xx.x9 | 46771 | BACKGROUND | 403 |
| xx.xx.xx.x5 | 46779 | BACKGROUND | 379 |
+--------------+----------+--------------+-------+
It's important to understand that since OceanBase Database is a distributed database, it typically consists of multiple nodes. Therefore, when querying ASH, it's best to include the SVR_IP and SVR_PORT fields, which uniquely identify an OBServer node. The SESSION_TYPE field indicates the type of session, which can be either FOREGROUND (foreground) or BACKGROUND (background). Foreground sessions are created when users establish a connection to OceanBase Database and are used to execute SQL requests sent by the users. Background sessions handle internal tasks of OceanBase Database, including dumps, RPCs, and PX tasks. From the query results, it's clear that the load is heavily concentrated on the xx.xx.xx.x6:46775 node, with an AAS of 59633/(3600 * 2) = 8.28, far exceeding the tenant's 3 CPU specification. Even after expanding to 6 CPUs, the background session requirements are still not met. Next, we need to investigate which background tasks on the xx.xx.xx.x6:46775 node are consuming so much CPU. In ASH, several fields can identify the attributes of background sessions, including PROGRAM and GROUP_ID. The PROGRAM field identifies the session's attributes, while the GROUP_ID field identifies which resource group the session belongs to. Therefore, we construct the following SQL query:
SELECT
PROGRAM,
GROUP_ID,
COUNT(*) as CNT
FROM V$OB_ACTIVE_SESSION_HISTORY
WHERE SESSION_TYPE = 'BACKGROUND'
AND SVR_IP = 'xx.xx.xx.x6'
AND SVR_PORT = '46775'
AND sample_time BETWEEN
'2025-03-13 8:00:00'
AND '2025-03-13 10:00:00'
GROUP BY PROGRAM, SESSION_TYPE ORDER BY CNT DESC limit 10;
The query results are as follows:
+-------------------------+----------+-------+
| PROGRAM | GROUP_ID | CNT |
+-------------------------+----------+-------+
| T1002_RPC_REQUEST | 28 | 53278 |
| T1002_LogService | 0 | 2793 |
| T1002_DAG | 0 | 1425 |
| T1002_IOManager | 0 | 1268 |
| T1002_LSRecoveryService | 0 | 438 |
+-------------------------+----------+-------+
After investigation, it was found that the main backend load on the xx.xx.xx.x6:46775 node came from RPC execution, and the resource group was 28. Resource group 28 is a dedicated resource group for internal tasks in OBServer and is allocated to the DBMS_SCHEDULER module. This led to the CPU usage on the OBServer node spiking. Further investigation confirmed that the DBMS_SCHEDULER module ran an SQL statistics collection task at 8:00 AM, and due to a configuration error, all statistics collection tasks were scheduled on the same node, causing the CPU usage to spike. Traditional methods of troubleshooting this issue would take a long time, and since the monitoring system did not provide more information, we had to analyze the OBServer runtime logs to identify the issue. In contrast, ASH, while not specifically configured for the DBMS_SCHEDULER module, was able to pinpoint the root cause by combining AAS metrics with backend attributes.
Waiting event analysis
Any active session in the database is either in a waiting state, waiting for a resource to become available or a condition to be met, or in an ON CPU state (such as parsing SQL statements, processing SQL operators, or waiting for memory access to complete). In OceanBase Database, we use waiting events to indicate that an active session is in a waiting state. Obviously, identifying the waiting events that occurred during the execution of a database session and the percentage of time spent in each event is crucial for database optimization.
We can use SQL to query the waiting event ratio of foreground sessions during a specific period:
SELECT
SESSION_STATE,
EVENT,
COUNT(*) as CNT
FROM V$OB_ACTIVE_SESSION_HISTORY
WHERE SESSION_TYPE = 'FOREGROUND'
AND SVR_IP = 'xx.xx.xx.x7'
AND SVR_PORT = '2828'
AND sample_time BETWEEN
'2025-06-14 8:15:00'
AND '2025-06-14 10:31:00'
GROUP BY SESSION_STATE, EVENT ORDER BY CNT DESC limit 10;
The query result is as follows:
The query results show that the tenant queue is severely backed up (wait in request queue) during the query period, which significantly affects SQL execution time.
Case 2: Locate I/O bottlenecks
A database request delay increased for an education company one afternoon, leading to customer complaints. The DBA used OCP to monitor and found that the data disk read bandwidth was fully utilized during the fault period. The DBA speculated that the increased delay was due to excessive SQL read operations on the disk. However, upon observing the top SQL queries during the fault period, no significant differences were found compared to normal periods, making it difficult to determine which specific SQL query was causing the disk read traffic.
To identify which SQL queries are related to the disk read traffic, you can query the ASH waiting events:
OceanBase(root@oceanbase)>select svr_ip, svr_port, con_id as tenant_id, sql_id, event, count(*) as cnt from GV$OB_ACTIVE_SESSION_HISTORY ash where wait_class in ('SYSTEM_IO', 'USER_IO') and sample_time between "2025-01-29 10:35:45" and "2024-01-29 10:45:47" group by svr_ip, svr_port, tenant_id, sql_id, event order by cnt desc limit 100;
The query results are as follows:
The table shows that four SQL statements are experiencing a significant number of I/O-related wait events during execution, which will greatly slow down the SQL response time.
Execution phase analysis
The execution process of SQL statements is divided into many phases. In OceanBase Database V4.2.5 BP3, 17 execution phases are defined. The v$ob_active_session_history view contains 17 columns to indicate these phases. If an SQL statement is in a specific phase at the sampling moment, the corresponding column value is 'Y'; otherwise, it is 'N'.
- IN_PARSE: The system is parsing the syntax of the SQL statement.
- IN_PL_PARSE: The system is parsing PL/SQL code (such as stored procedures and functions).
- IN_PLAN_CACHE: The system is retrieving or generating the execution plan for the SQL statement.
- IN_SQL_OPTIMIZE: The system is optimizing the execution path of the SQL statement.
- IN_SQL_EXECUTION: The SQL statement is being executed (such as accessing data and computing results).
- IN_PX_EXECUTION: The system is executing the SQL statement in parallel using multiple threads or nodes (parallel query).
- IN_SEQUENCE_LOAD: The system is generating unique values for auto-increment columns (such as ID fields) or sequences.
- IN_COMMITTING: The system is committing the transaction.
- IN_STORAGE_READ: The system is reading data from the storage engine (such as tables and indexes).
- IN_STORAGE_WRITE: The system is writing data to the storage engine.
- IN_REMOTE_DAS_EXECUTION: The system is executing the SQL statement in a distributed manner using the DAS execution engine.
- IN_FILTER_ROWS: The storage engine of the system is executing the execution of the pushed-down operators (such as filtering and aggregation).
- IN_RPC_ENCODE: The system is encoding the request/response data.
- IN_RPC_DECODE: The system is decoding the request/response data.
- IN_CONNECTION_MGR: The system is handling the creation or closure of sessions (connection pool management).
- IN_PLSQL_COMPILATION: The system is compiling PL/SQL code (such as stored procedures).
- IN_PLSQL_EXECUTION: The system is executing PL/SQL code (such as stored procedures).
In OceanBase Database, waiting events are interrelated. For example, a session in the IN_SQL_EXECUTION phase may be accessing the storage layer, thus simultaneously being in the IN_STORAGE_READ or IN_STORAGE_WRITE phase.
Case 3: Identify the bottleneck in distributed execution
In the morning, users experienced delays in operations for a middleware service. Monitoring showed that the CPU of the xx.xx.xx.x9 node in the database cluster was continuously at full capacity, leading to a significant increase in SQL execution latency. The DBA tried switching the primary node and restarting the node, which eventually restored the CPU usage to normal. After the incident, the audit logs (SQL Audit) were lost due to the restart. To analyze the issue, we examined the ASH (Active Session History) data for the period from 10:30 to 10:45. Key findings included:
OceanBase(root@oceanbase)>select svr_ip, tenant_id, IN_REMOTE_DAS_EXECUTION, name.name as event, count(*) * 10 as cnt from CDB_WR_ACTIVE_SESSION_HISTORY ash left join v$event_name name on ash.event_no=name.`event#` where sample_time between "2024-11-29 10:30:45" and "2024-11-29 10:45:45" and tenant_id=1002 group by svr_ip, tenant_id, IN_REMOTE_DAS_EXECUTION, event order by cnt desc limit 20;
The query results are as follows:
+---------------+-----------+-------------------------+--------------------------+------+
| svr_ip | tenant_id | IN_REMOTE_DAS_EXECUTION | event | cnt |
+---------------+-----------+-------------------------+--------------------------+------+
| xx.xx.xx.x9 | 1002 | Y | | 48330 |
| xx.xx.xx.x4 | 1002 | N | das wait remote response | 28920 |
| xx.xx.xx.x1 | 1002 | N | das wait remote response | 20050 |
| xx.xx.xx.x9 | 1002 | N | default condition wait | 2460 |
| xx.xx.xx.x4 | 1002 | N | default condition wait | 1450 |
| xx.xx.xx.x1 | 1002 | N | px loop condition wait | 1420 |
| xx.xx.xx.x9 | 1002 | N | px loop condition wait | 1330 |
| xx.xx.xx.x9 | 1002 | N | | 1170 |
| xx.xx.xx.x4 | 1002 | N | | 900 |
| xx.xx.xx.x1 | 1002 | N | default condition wait | 830 |
| xx.xx.xx.x4 | 1002 | N | default condition wait | 570 |
| xx.xx.xx.x4 | 1002 | N | px loop condition wait | 530 |
| xx.xx.xx.x9 | 1002 | Y | | 520 |
+---------------+-----------+-------------------------+--------------------------+------+
Key findings:
| Node IP | Key Stage/Waiting Event | Session Count (Percentage) |
|---|---|---|
| xx.xx.xx.x9 | Remote DAS Task (IN_REMOTE_DAS_EXECUTION) | 48,330 times (much higher than other nodes) |
| Other Nodes | Waiting for Remote Response (das wait remote response) | 28,920 times, 20,050 times |
Derived conclusions:
Root Cause:
- The faulty node (xx.xx.xx.x9) is experiencing CPU overload due to handling a large number of remote DAS requests from other nodes.
- Other nodes (such as xx.xx.xx.x4 and xx.xx.xx.x1) are experiencing significant thread contention due to waiting for responses from the faulty node, leading to queue buildup and further system congestion.
Deeper Cause:
Through SQL dimension analysis, it was found that two scheduled AP business SQL statements (such as batch data synchronization tasks) trigger in the morning, consuming excessive CPU resources when executed concurrently, thereby displacing resources needed by normal business SQL statements.