Alert description
This alert is triggered when there are hanging XA transactions in the OceanBase tenant. The XA protocol uses the two-phase commit protocol (Prepare and Commit phases). If an XA transaction remains in the Prepared state for more than the threshold (which is 20 minutes by default), it is considered a hanging XA transaction. Currently, only Oracle tenants support the detection of hanging XA transactions.
Note
A hanging transaction is a transaction that has been running for a long time, and the connection between the client and OBProxy has been disconnected, but the connection between OBProxy and OceanBase still exists.
Alerting principle
The following table lists key parameters involved in the alerting monitoring logic.
| Parameter | Value |
|---|---|
| Monitoring metric | xa_trans_max_duration_seconds |
| Data source |
|
| Metrics to be collected | collect_time,ctx_create_time |
| Monitoring expression | max((collect_time - ctx_create_time)/1000000) |
| Collection interval | 60 seconds |
Notice
The system parameter ocp.alarm.datasource.trans-min-duration-seconds determines the threshold at which an alert is generated. Therefore, the alert threshold must not be lower than the specified value of this system parameter, which defaults to 60 seconds.
Alarm rules
| Monitoring metric | Default threshold (unit: seconds) | Monitoring metric source | Alert cycle | Elimination cycle |
|---|---|---|---|---|
| xa_trans_max_duration_seconds | 1200 | Tenant metric | 60 seconds | 5 minutes |
Alarm information
| Alarm trigger method | Alarm level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Severe | Tenant |
Alarm template
Overview
Template: ${alarm_target} ${alarm_name}
Example: ob_cluster=obcluster-1:tenant_name=orac2:trans_hash={hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509} OceanBase tenant has an XA hanging transaction
Details
Template: Cluster: ${ob_cluster_name}, Tenant: ${tenant_name} has an XA hanging transaction. Session ID: ${session_id}, Transaction ID: ${trans_hash}, Transaction creation time: ${trans_create_time}, Transaction maximum duration: ${value_shown}.
Example: Cluster: obcluster-1, Tenant: orac2 has an XA hanging transaction. Session ID: 3221635048, Transaction ID: {hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509}, Transaction creation time: 2022-03-11T18:05:21.184+08:00, Transaction maximum duration: 25 days 19 hours 57 minutes 24.66 seconds.
Restoration
- Template: Alarm: ${alarm_name}
- Example: Alarm: OceanBase tenant has an XA hanging transaction
Impact on the system
Hanging transactions hold the reference of MemTable. If the transaction does not end for a long time, the MemStore freezes and cannot be released, which may cause memory exhaustion. Moreover, hanging transactions pose a serious threat to system stability, such as log disk space exhaustion and system write stop.
Possible causes
Unreasonable parameter settings.
The query timeout period (tenant parameter ob_query_timeout, default value 10 seconds) and transaction timeout period (tenant parameter ob_trx_timeout, default value 100 seconds) are set. It is recommended to set these parameters reasonably to avoid long-running transactions caused by improper settings, which affects system stability.
Deadlocks or lock conflicts.
Solution
When a hanging transaction occurs, prioritize terminating the hanging transaction. You can follow these steps:
Query the xid of the tenant that triggered the alert.
The xid consists of three parts: gtrid, bqual, and format_id. Note that gtrid and bqual may be binary types and cannot be displayed as character information. Therefore, they need to be converted into hexadecimal.
In the sys tenant, assume that the target tenant ID is 1003. Execute the following statement to obtain the xid of the vulnerable hanging branch:
SELECT hex(gtrid), hex(bqual), format_id FROM __all_virtual_global_transaction WHERE tenant_id = 1003 AND format_id <> -2 AND state = 3 AND gmt_modified < date_sub(now(), INTERVAL 1800 SECOND);In an Oracle tenant, use the following query method:
SELECT rawtohex(gtrid), rawtohex(bqual), format_id FROM sys.all_virtual_tenant_global_transaction_agent WHERE format_id <> -2 AND state = 3 AND ROUND((sysdate - cast(GMT_MODIFIED as date)) * 86400) > 1800;
In the corresponding tenant, execute the following statement to roll back the transaction corresponding to the xid.
Note
Replace the values of l_xid.formatid, l_xid.gtrid, and l_xid.bqual in the following SQL statement with the IDs obtained in Step 1.
declare l_xid DBMS_XA_XID; l_ret PLS_INTEGER; BEGIN l_xid. formatid := 123; l_xid. gtrid := hextoraw('616263313238'); l_xid. bqual := hextoraw('656667'); l_ret := DBMS_XA.XA_ROLLBACK(xid => l_xid); dbms_output.put_line(l_ret); END; /