Description
This alert is triggered when an eXtended Architecture (XA) transaction of an OceanBase Database tenant is suspended. The XA protocol is a two-phase commit protocol. The two phases are Prepare and Commit. When an XA transaction has stayed in the Prepare phase for a period longer than the alert threshold, which is 20 minutes by default, the XA transaction is considered as a suspended transaction. Currently, the detection on suspended XA transactions is carried out only for Oracle tenants. Note
Suspended transactions are transactions that have timed out for a long time in the case where the OBProxy cluster cannot connect to the client, but can still connect to the OceanBase cluster.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | xa_trans_max_duration_seconds |
| Source | * SQL collection * OceanBase Database of a version earlier than V2.2.7 does not support XA transactions. * When the OceanBase Database version is V2.2.7 to a version earlier than V3.2: sql select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) */ a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) *1000000 as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id * When the OceanBase Database version is V3.2 or later: sql select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) */ a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no, log_size_byte from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) *1000000 as ctx_create_time, session_id, participants, (pending_log_size + flushed_log_size) as log_size_byte, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id * XA transaction latency: sql select max((collect_time - ctx_create_time)/1000000) as trans_max_duration_seconds from ob_hist_trans_stat_0 where trans_type=3 and ctx_trans_state=3 |
| Collected metric | collect_time,ctx_create_time |
| Metric expression | max((collect_time - ctx_create_time)/1000000) |
| Collection cycle | 60 seconds |
Notice
The system parameter ocp.alarm.datasource.trans-min-duration-seconds specifies the transaction duration threshold, in seconds, that triggers an alert. The value of xa_trans_max_duration_seconds must be equal to or greater than the value of this system parameter. The default value is 60 seconds.
Alert rule
| Metric | Default threshold (unit: s) | Source | Alert cycle | Time before clearance |
|---|---|---|---|---|
| xa_trans_max_duration_seconds | 1200 | Tenant | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Tenant |
Alert templates
Alert overview
Template: ${alarm_target} ${alarm_name}
Example: ob_cluster=obcluster-2:tenant_name=orac2:trans_hash={hash:10801753558860391353, inc:59202486, addr:"192.168.0.1:2882", t:1646993121179509} A suspended XA transaction exists in the OceanBase Database tenant.
Alert details
Template: Cluster: ${ob_cluster_name}; Tenant: A suspended XA transaction exists in the ${tenant_name} tenant; Session ID: ${session_id}; Transaction ID: ${trans_hash}; Transaction Created At: ${trans_create_time}; Maximum Transaction Duration: ${value_shown}
Example: Cluster: obcluster; Tenant: A suspended XA transaction exists in the orac2 tenant. Session ID: 3221635048; Transaction ID: {hash:10801753558860391353, inc:59202486, addr:"192.168.0.1:2882", t:1646993121179509}; Transaction Created At: 2022-03-11T18:05:21.184+08:00;Maximum Transaction Duration: 25 days 19 hours 57 minutes and 24.66 seconds
Impact on the system
A suspended transaction occupies the reference count of the MemTable. If the transaction is not finished, the MemStore cannot be released after it is frozen. Then, the memory utilization may exceed the upper limit. A suspended transaction will also affect system stability. For example, it may lead to high clog disk usage and the pause of data write.
Possible causes
The parameters are set to inappropriate values.
The ob_query_timeout parameter specifies the query timeout period, which is 10 seconds by default. The ob_trx_timeout parameter specifies the transaction timeout period, which is 100 seconds by default. We recommend that you set these parameters to appropriate values to prevent system instability caused by suspended transactions.
Deadlocks exist or lock conflicts occurred.
Solutions
When a transaction is suspended, we recommend that you preferentially terminate this transaction. You can terminate the transaction by performing the following steps:
Query the X/Open identifier (xid) for the target tenant in the alert.
An xid consists of the global transaction ID (gtrid), branch qualifier (bqual), and format ID (format_id). The values of gtrid and bqual may be binary values, and therefore cannot be displayed in character. To display them, convert them to hexadecimal characters.
In a SYS tenant, execute the following statement to query the xid of the transaction that may be suspended. In this example, the ID of the target tenant is 1003.
SELECT hex(gtrid), hex(bqual), format_id FROM __all_virtual_global_transaction WHERE tenant_id = 1003 AND format_id <> -2 AND state = 3 AND gmt_modified < date_sub(now(), INTERVAL 1800 SECOND);In a user tenant in Oracle mode, execute the following statement to query the xid:
SELECT rawtohex(gtrid), rawtohex(bqual), format_id FROM sys.all_virtual_tenant_global_transaction_agent WHERE format_id <> -2 AND state = 3 AND ROUND((sysdate - cast(GMT_MODIFIED as date)) * 86400) > 1800;
Execute the following statement in the corresponding tenant to roll back the transaction corresponding to the xid:
Note
Before you execute the statement, replace the values of the l_xid.formatid, l_xid.gtrid, and l_xid.bqual parameters with those in the returned xid in step 1.
declare l_xid DBMS_XA_XID; l_ret PLS_INTEGER; BEGIN l_xid. formatid := 123; l_xid. gtrid := hextoraw('616263313238'); l_xid. bqual := hextoraw('656667'); l_ret := DBMS_XA.XA_ROLLBACK(xid => l_xid); dbms_output.put_line(l_ret); END; /