ob_tenant_long_trans_exist

2025-04-02 06:36:32  Updated

Description

This alert is triggered when the duration of a transaction on the OBServer exceeds the threshold. The default threshold is 1,200 seconds.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric long_trans_max_duration_seconds
Source
  • SQL statements for metric collection:
    • If you use OceanBase Database of a version earlier than V2.2.7, execute the following statement: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) */ tenant_id, 0 as tenant_id_xa, trans_id, -1 as tx_id, -1 as ls_id, `partition`, floor(unix_timestamp(ctx_create_time) *1000000) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no, 0 as log_size_byte, 'NULL' as globalid from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1
    • If you use OceanBase Database V2.2.7 to V3.2, execute the following statement: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) */ a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, -1 as tx_id, -1 as ls_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no, 0 as log_size_byte, 'NULL' as globalid from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) *1000000 as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id
    • If you use OceanBase Database V3.2 to V4.0, execute the following statement: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(%d) */ a.tenant_id, b.tenant_id as tenant_id_xa, a.trans_id, -1 as tx_id, -1 as ls_id, `partition`, floor(ctx_create_time) as ctx_create_time, session_id, participants, trans_type, part_trans_action, sql_no, log_size_byte, 'NULL' as globalid from (select tenant_id, svr_ip, trans_id, `partition`, unix_timestamp(ctx_create_time) *1000000 as ctx_create_time, session_id, participants, (pending_log_size + flushed_log_size) as log_size_byte, trans_type, part_trans_action, sql_no from __all_virtual_trans_stat where svr_ip = ? and svr_port = ? and is_exiting != 1) a left join __all_virtual_global_transaction b on a.tenant_id = b.tenant_id and a.trans_id = b.trans_id
    • If you use OceanBase Database V4.0 to V4.1, execute the following statement: select /*+ MONITOR_AGENT QUERY_TIMEOUT(%d) */tenant_id, 0 as tenant_id_xa, 'NULL' as trans_id, tx_id, ls_id, 'NULL' as 'partition', floor(unix_timestamp(ctx_create_time) *1000000) as ctx_create_time, session_id, participants, (case tx_type when 'DISTRIBUTED' then 2 when 'LOCAL' then 1 when 'UNDECIDED' then -1 else 0 end) as trans_type, (case action when 'START TASK' then 1 when 'END TASK' then 2 when 'COMMIT' then 3 when 'ABORT' then 4 when 'DIED' then 5 when 'END' then 6 else 0 end) as part_trans_action, -1 as sql_no, (pending_log_size + flushed_log_size) as log_size_byte, 'NULL' as globalid from v$ob_transaction_participants where svr_ip = ? and svr_port = ?
    • If you use OceanBase Database V4.1 or later, execute the following statement: select /*+ MONITOR_AGENT QUERY_TIMEOUT(%d) */tenant_id, 0 as tenant_id_xa, 'NULL' as trans_id, tx_id, ls_id, 'NULL' as 'partition', floor(unix_timestamp(ctx_create_time) *1000000) as ctx_create_time, session_id, participants, (case tx_type when 'DISTRIBUTED' then 2 when 'LOCAL' then 1 when 'UNDECIDED' then -1 else 0 end) as trans_type, (case action when 'START TASK' then 1 when 'END TASK' then 2 when 'COMMIT' then 3 when 'ABORT' then 4 when 'DIED' then 5 when 'END' then 6 else 0 end) as part_trans_action, -1 as sql_no, (pending_log_size + flushed_log_size) as log_size_byte, globalidfrom v$ob_transaction_participants where svr_ip = ? and svr_port = ?
  • Transaction latency:
      select max((collect_time - ctx_create_time)/1000000) as trans_max_duration_seconds from ob_hist_trans_stat where ctx_trans_state=2
Collected metric collect_time,ctx_create_time
Metric expression max((collect_time - ctx_create_time)/1000000)
Collection cycle 60 seconds

Note

The system parameter ocp.alarm.datasource.trans-min-duration-seconds specifies the transaction duration threshold, in seconds, that triggers an alert. The value of long_trans_max_duration_seconds must be equal to or greater than the value of this system parameter. The default value is 60 seconds.

Alert rule

Metric Default threshold (unit: s) Source Detection cycle Time before clearance
long_trans_max_duration_seconds 1200 Tenant 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the metric expression Critical Tenant

Alert templates

  • Alert overview

    • Template: ${alarm_target} ${alarm_name}

    • Example: ob_cluster=obcluster-1:tenant_name=orac2:trans_hash={hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509}. A long-running transaction exists in the OceanBase Database tenant.

  • Alert details

    • Template: Cluster: ${ob_cluster_name}. Tenant: A long-running transaction exists in the ${tenant_name} tenant. Session ID: ${session_id}. Transaction ID: ${trans_hash}. Transaction Type: ${trans_type}. Transaction Created At: ${trans_create_time}. Maximum Transaction Duration: ${value_shown}

    • Example: Cluster: obcluster-1. Tenant: A long-running transaction exists in the orac2 tenant. Session ID: 3221635048. Transaction ID: {hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509}. Transaction Type: distribute. Transaction Created At: 2022-03-11T18:05:21.184+08:00. Maximum Transaction Duration: 25 days 19 hours 57 minutes and 24.66 seconds.

Impact on the system

Long-running transactions cause serious data jams and lock timeouts, affecting the execution of other transactions.

Possible causes

Long-running transactions are likely to occur when the cluster is leaderless because the candidates fail to receive the majority of votes from other nodes in the cluster during the leader election (the "minority problem"), the disk is full, the memory is used up, or large transactions are under processing.

Solutions

  1. Check for the minority problem.

    The minority problem occurs because of OBServer exceptions or network failure. It also triggers the ob_cannot_connected alert at the same time.

    In that case, you need to first solve the problem of the ob_cannot_connected alert by referring to the corresponding topic. Then, check whether the ob_tenant_long_trans_exist alert is cleared 5 minutes later.

  2. Check the disk space.

    Insufficient disk space also triggers the following alerts at the same time. We recommend that you first solve the problems that caused the following alerts by referring to the respective topics, and then check whether the ob_tenant_long_trans_exist alert is cleared 5 minutes later.

  3. Check the memory.

    Insufficient memory may also trigger the alert of memory usage exceeding the threshold at the same time. For more information, see ob_host_mem_percent_over_threshold.

  4. Check for large transactions.

    On the Performance Diagnostics page of the OceanBase Cloud Platform (OCP) console, click the TopSQL and SlowSQL tabs to check for suspicious SQL statements such as long-running SQL statements.

    If any, analyze the cause of the problem and optimize the SQL statements based on the diagnosis result. You can also consider splitting a large transaction into small transactions as needed.

  5. If the problem persists despite the preceding operations, run the following commands and send the result to OceanBase Technical Support for troubleshooting.

    -- View the information of all servers to see if a server is faulty.
    select * from __all_server;
    
    -- View the long-running transaction.
    SELECT *
    FROM __all_virtual_trans_stat
    WHERE is_exiting !=1 AND part_trans_action <= 2 AND ctx_create_time < DATE_SUB(NOW(), INTERVAL 1200 SECOND)
    LIMIT 100;
    

Contact Us