Alert description
This alert is triggered when transactions remain in the commit stage for 1200 seconds or more within the tenant.
Alerting principle
The following table describes the key parameters involved in the alerting monitoring logic.
| Parameter | Value |
|---|---|
| Monitoring metric | pending_trans_max_duration_seconds |
| Data source |
|
| Collected metrics | collect_time,ctx_create_time |
| Monitoring expression | max((collect_time - ctx_create_time)/1000000) |
| Collection interval | 60 seconds |
Rule information
| Monitoring metric | Default threshold | Monitoring metric source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| pending_trans_max_duration_seconds | 1200 | Tenant metrics | 60 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Critical | Tenant |
Alert template
Alert summary
Template: ${alarm_target} ${alarm_name}
Example: ob_cluster=obcluster-1:tenant_name=orac2:trans_hash={hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509} OceanBase tenant has a pending transaction
Alert details
Template: Cluster: ${ob_cluster_name}, Tenant: ${tenant_name} has a pending transaction. Session ID: ${session_id}, Transaction ID: ${trans_hash}, Transaction type: ${trans_type}, Transaction creation time: ${trans_create_time}, Maximum duration of the transaction: ${value_shown}.
Example: Cluster: obcluster-1, Tenant: orac2 has a pending transaction. Session ID: 3221635048, Transaction ID: {hash:10801753558860391353, inc:59202486, addr:"xxx.xxx.xxx.xxx:2882", t:1646993121179509}, Transaction type: distribute, Transaction creation time: 2022-03-11T18:05:21.184+08:00, Maximum duration of the transaction: 25 days 19 hours 57 minutes 24.66 seconds.
Alert recovery
- Template: Alert: ${alarm_name}
- Example: Alert: OceanBase tenant has a pending transaction
Impact on the system
A pending transaction causes the MemStore to stop the flush, and the business stops.
Possible causes
- This alert is usually caused by a minority, full disk, or memory overflow.
- The machine clock is out of synchronization by more than 100 ms.
Solution
Check whether a minority occurs.
A minority occurs generally because of OBServer node exceptions or network failures, and the ob_cannot_connected OB server cannot be connected alert is reported.
If this alert is reported, refer to the alert documentation to handle the issue, and then check whether the alert in this section is resolved 5 minutes later.
Check whether the disk space is insufficient.
If the disk space is insufficient, the following alerts are reported at the same time. First, refer to the corresponding alert documentation to resolve the issue, and then check whether the alert in this section is resolved 5 minutes later.
Check whether the memory is insufficient.
If the memory is insufficient, the ob_host_mem_percent_over_threshold OB server memory usage exceeds the threshold alert may be reported. For more information, see the solution of this alert.
If the issue persists, contact Technical Support by using the following commands:
-- View information about all servers to check whether an OBServer node is abnormal. select * from __all_server; -- View the current hanging transactions. SELECT * FROM __all_virtual_trans_stat WHERE is_exiting !=1 AND part_trans_action > 2 AND ctx_create_time < DATE_SUB(NOW(), INTERVAL 500 SECOND) LIMIT 100;