ob_tenant_expired_trans_exist |V3.3.1|OceanBase Cloud Platform| docs|Distributed Database

ob_tenant_expired_trans_exist

Last Updated：2023-08-15 11:20:59 Updated

Description

This alert is triggered when a transaction in the OBServer cluster is suspended in the commit phase for 1200s or longer.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter	Value
Metric	pending_trans_max_duration_seconds
Source	* SQL collection * When the OceanBase Database version is earlier than V2.2.7: sql select /+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) / tenant_id, trans_id, `partition`, floor(unix_timestamp(ctx_create_time) 1000000) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1 When the OceanBase Database version is V2.2.7 to a version earlier than V3.2: sql select /+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) / a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) 1000000 as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id When the OceanBase Database version is V3.2 or later: sql select /+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) / a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no, log_size_byte from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) 1000000 as ctx_create_time, session_id, participants, (pending_log_size + flushed_log_size) as log_size_byte, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id Suspended transaction latency: `sql select max((collect_time - ctx_create_time)/1000000) as trans_max_duration_seconds from ob_hist_trans_stat_0 where trans_type != 3 and ctx_trans_state=3`
Collected metric	collect_time,ctx_create_time
Metric expression	max((collect_time - ctx_create_time)/1000000)
Collection cycle	60 seconds

Alert rule

Metric	Default threshold	Source	Detection cycle	Time before clearance
pending_trans_max_duration_seconds	1200	Tenant	60 seconds	5 minutes

Alert information

Trigger method	Alert level	Scope
Based on the expression of the metric	Critical	Tenant

Alert templates

Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-2:tenant_name=orac2:trans_hash={hash:10801753558860391353, inc:59202486, addr:"192.168.0.1:2882", t:1646993121179509} A suspended transaction exists in the OceanBase Database tenant.
Alert details
- Template: Cluster: ${ob_cluster_name}; Tenant: A suspended transaction exists in the ${tenant_name} tenant. Session ID: ${session_id}; Transaction ID: ${trans_hash}; Transaction Type: ${trans_type}; Transaction Created At: ${trans_create_time}; Maximum Transaction Duration: ${value_shown}
- Example: Cluster: obcluster; Tenant: A suspended transaction exists in the orac2 tenant. Session ID: 3221635048; Transaction ID: {hash:10801753558860391353, inc:59202486, addr:"192.168.0.1:2882", t:1646993121179509}; Transaction Type: distribute; Transaction Created At: 2022-03-11T18:05:21.184+08:00; Maximum Transaction Duration: 25 days 19 hours 57 minutes and 24.66 seconds

Impact on the system

When a transaction is suspended, the minor compaction of the MemStore is suspended and the application is stopped.

Possible causes

The transaction is suspended when the cluster is leaderless because the candidates fail to receive the majority of votes from other nodes in the cluster during the leader election (the "minority problem"), the disk is full, or the memory is used up.

Solutions

Check for the minority problem.

The minority problem occurs because of the OBServer exceptions or network failure. It also triggers the ob_cannot_connected alert at the same time.

In that case, you need to first solve the problem of the ob_cannot_connected alert by referring to the corresponding topic. Then, check whether the ob_server_exists_long_lived_trans alert is cleared 5 minutes later.
Check the disk space.

Insufficient disk space also triggers the following alerts at the same time. We recommend that you first solve the problems that caused the following alerts by referring to the respective topics, and then check whether the ob_server_exists_long_lived_trans alert is cleared 5 minutes later.
Check the memory.

Insufficient memory may also trigger the alert of memory usage exceeding the threshold at the same time. For more information, see ob_host_mem_percent_over_threshold.

If the problem persists despite the preceding operations, run the following commands and send the result to OceanBase Technical Support for troubleshooting.

-- View the information of all servers to see if a server is faulty.
select * from __all_server;

-- View the suspended transaction.
SELECT *
FROM __all_virtual_trans_stat
WHERE is_exiting !=1 AND part_trans_action > 2 AND ctx_create_time < DATE_SUB(NOW(), INTERVAL 500 SECOND)
LIMIT 100;

Community Edition

Enterprise Edition