Description
This alert is triggered when an XA transaction of an OceanBase Database tenant is suspended. The XA protocol is a two-phase commit protocol. The two phases are Prepare and Commit. When an XA transaction has stayed in the Prepare phase for a period longer than the alert threshold, which is 20 minutes by default, the XA transaction is considered as a suspended transaction. Currently, the detection on suspended XA transactions is only carried out for Oracle tenants. Note
Suspended transactions are transactions that have timed out for a long time in the case where the OBProxy cluster cannot connect to the client, but can still connect to the OceanBase cluster.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_tenant_xa_trans_duration_seconds Note The duration in which the XA transaction is in the Prepare phase. An alert is triggered when the value of this metric exceeds the threshold. By default, the threshold is 20 minutes. |
| Source | unknow SELECT /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ __all_tenant.tenant_id, __all_tenant.tenant_name, trans.xa_trans_duration FROM (SELECT vir_trans.tenant_id, TIMESTAMPDIFF(SECOND , vir_global_trans.gmt_modified , CURRENT_TIMESTAMP) xa_trans_duration FROM __all_virtual_global_transaction vir_global_trans LEFT JOIN __all_virtual_trans_stat vir_trans ON vir_global_trans.trans_id = vir_trans.trans_id WHERE vir_global_trans.format_id <> -2 AND vir_global_trans.state = 3 AND vir_trans.is_exiting != 1 AND vir_trans.svr_ip = ? and svr_port = ? GROUP BY vir_trans.tenant_id HAVING xa_trans_duration > 30) trans LEFT JOIN __all_tenant ON __all_tenant.tenant_id = trans.tenant_id Note The value of trans.xa_trans_duration is assigned to the collected metric, and other values are used as labels. Question marks (?) in the preceding SQL statement are variables. You need to specify them when executing the SQL statement. |
| Collected metric | ob_xa_trans_duration_seconds |
| Metric expression | max(ob_xa_trans_duration_seconds{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 minute |
Alert rule
| Metric | Default threshold (unit: %) | Duration | Alert cycle | Elimination cycle |
|---|---|---|---|---|
| ob_tenant_xa_trans_duration_seconds | 0 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: cluster: ${ob_cluster}, tenant: ${tenant_name}, host: ${svr_ip} (zone: ${obzone}), alert: The XA transaction may be suspended. The execution time of the transaction is ${value} seconds, exceeding the threshold of ${alarm_threshold} seconds.
Overview example: svr_ip=192.168.1.1 Agent service unavailable
Details example: cluster: obcluster1, tenant: oracle1, host: 192.168.1.1, zone: zone1, alert: The XA transaction may be suspended. The execution time of the transaction is 1,800 seconds, exceeding the threshold of 1,200 seconds.
Impact on the system
A suspended transaction occupies the reference count of the MemTable. If the transaction is not finished, the MemStore cannot be released after it is frozen. Then, the memory utilization may exceed the upper limit. A suspended transaction will also affect system stability. For example, it may lead to high clog disk usage and the pause of data write.
Possible causes
The parameters are set to inappropriate values.
The ob_query_timeout parameter specifies the query timeout period, which is 10 seconds by default. The ob_trx_timeout parameter specifies the transaction timeout period, which is 100 seconds by default. We recommend that you set these parameters to appropriate values to prevent system instability caused by suspended transactions.
Deadlocks exist or lock conflicts happen.
Suggested solutions
When a transaction is suspended, we recommend that you preferentially terminate this transaction. You can terminate the transaction by performing the following steps:
Query the X/Open identifier (xid) for the target tenant in the alert.
An xid consists of the global transaction ID (gtrid), branch qualifier (bqual), and format ID (format_id). The values of gtrid and bqual may be binary values, and therefore cannot be displayed in character. To display them, convert them to hexadecimal characters.
If you are a SYS tenant, execute the following statement to query the xid of the transaction that may be suspended. In this example, the ID of the target tenant is 1003.
SELECT hex(gtrid), hex(bqual), format_id FROM __all_virtual_global_transaction WHERE tenant_id = 1003 AND format_id <> -2 AND state = 3 AND gmt_modified < date_sub(now(), INTERVAL 1800 SECOND);If you are a user tenant (in Oracle mode), execute the following statement to query the xid:
SELECT rawtohex(gtrid), rawtohex(bqual), format_id FROM sys.all_virtual_tenant_global_transaction_agent WHERE format_id <> -2 AND state = 3 AND ROUND((sysdate - cast(GMT_MODIFIED as date)) * 86400) > 1800;
Execute the following statement as the corresponding tenant to roll back the transaction corresponding to the xid:
Note
Before you execute the statement, replace the values of the l_xid.formatid, l_xid.gtrid, and l_xid.bqual parameters with those in the returned xid in step 1.
declare l_xid DBMS_XA_XID; l_ret PLS_INTEGER; BEGIN l_xid. formatid := 123; l_xid. gtrid := hextoraw('616263313238'); l_xid. bqual := hextoraw('656667'); l_ret := DBMS_XA.XA_ROLLBACK(xid => l_xid); dbms_output.put_line(l_ret); END; /