ob_server_exists_long_lived_trans |V3.1.2|OceanBase Cloud Platform| docs|Distributed Database

ob_server_exists_long_lived_trans

Last Updated：2023-08-15 11:21:17 Updated

Description

This alert is triggered when the duration of a transaction on the OBServer exceeds the threshold. The default threshold is 1200 seconds.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter	Value
Metric	ob_server_max_trans_duration_seconds
Source	SQL: `unknow SELECT MAX(timestampdiff(SECOND,ctx_create_time,CURRENT_TIMESTAMP)) as max_trans_duration_seconds FROM __all_virtual_trans_stat WHERE part_trans_action <= 2 AND svr_ip = @svr_ip AND svr_port = rpc_port() AND timestampdiff(SECOND,ctx_create_time,CURRENT_TIMESTAMP) >= 30 HAVING max_trans_duration_seconds IS NOT NULL;`
Collected metric (unit: s)	max_trans_duration_seconds
Metric expression	max(max_trans_duration_seconds{metric_group="max_duration_virtual_trans_stat",@LABELS}) by (@GBLABELS)
Collection cycle	60 seconds

The value of the metric ob_server_max_trans_duration_seconds indicates the maximum duration of a transaction when the SQL statements are executed. When this value is greater than the threshold, this alert is triggered. The default threshold is 1200s.

Alert rule

Metric	Default threshold (unit: s)	Duration	Detection cycle	Time before clearance
ob_server_max_trans_duration_seconds	1200	0 seconds	60 seconds	5 minutes

Alert information

Trigger method	Alert level	Scope
Metric expression	Critical	Server

Alert templates

Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The maximum transaction duration is ${value}s, exceeding the threshold of ${alarm_threshold}s.
Overview example: ob_cluster=first:svr_ip=192.168.1.1. The OBServer has a long-running transaction.
Details example: ob_cluster=first:svr_ip=192.168.1.1. The OBServer has a long-running transaction. The maximum transaction duration is 1500.0s, exceeding the threshold of 1200.0s.

${alarm_target} indicates the object that generated the alert, in the ob_cluster=xx:svr_ip=xx format. ob_cluster indicates the name of the OceanBase cluster that generated the alert, and svr_ip indicates the IP address of the OBServer that generated the alert.

Impact on the system

Long-running transactions cause serious data jams and lock timeouts, affecting the execution of other transactions.

Possible causes

Long-running transactions are likely to occur when the cluster is leaderless because the candidates fail to receive the majority of votes from other nodes in the cluster during the leader election (the "minority problem"), the disk is full, the memory is used up, or large transactions are under processing.

Suggested solutions

Check for the minority problem.

The minority problem occurs because of the OBServer exceptions or network failure. It also triggers the ob_cannot_connected alert at the same time.

In that case, you need to first solve the problem of the ob_cannot_connected alert by referring to the corresponding topic. Then, check whether the ob_server_exists_long_lived_trans alert is cleared 5 minutes later.
Check the disk space.

Insufficient disk space also triggers the following alerts at the same time. We recommend that you first solve the problems that caused the following alerts by referring to the respective topics, and then check whether the ob_server_exists_long_lived_trans alert is cleared 5 minutes later.
Check the memory.

Insufficient memory may also trigger the alert of memory usage exceeding the threshold at the same time. For more information, see ob_host_mem_percent_over_threshold.
Check for large transactions.

Go to the performance diagnosis page of the OCP console. Then, check the TopSQL and SlowSQL tabs for abnormal SQL statements that, for example, have been running for a long time.

If any, analyze the diagnosis results and optimize the SQL statements. You can also consider splitting a large transaction into small transactions as needed.

If the problem persists despite the preceding operations, run the following commands and send the result to OCP Technical Support for troubleshooting.

-- View the information of all servers to see if a server is faulty.
select * from __all_server;

-- View the long-running transaction.
SELECT *
FROM __all_virtual_trans_stat
WHERE part_trans_action <= 2 AND ctx_create_time < DATE_SUB(NOW(), INTERVAL 1200 SECOND)
LIMIT 100;