Description
This alert is triggered when a transaction in the OBServer is suspended in the commit phase for 600s or longer.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_host_expired_trans_count Note: This metric indicates the number of suspended transactions of the OBServer. When the number exceeds the threshold, this alert is triggered. The default threshold is 0. |
| Source | SQL: select count(1) from __all_virtual_trans_stat where part_trans_action > 2 and ctx_create_time < date_sub(now(), interval 600 second) and is_exiting != 1 and svr_ip = @svr_ip and svr_port = rpc_port(); |
| Collected metric | expire_trans_count |
| Metric expression | sum(expire_trans_count{metric_group="all_virtual_trans_stat",@LABELS}) by (@GBLABELS) |
| Collection cycle | 60 seconds |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_host_expired_trans_count | 0 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The number of suspended transactions is ${value}.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The OBServer has suspended transactions.
Details example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The OBServer has suspended transactions. The number of suspended transactions is 1.0.
Impact on the system
When a transaction is suspended, the minor compaction of the MEMStore is suspended and the application is stopped.
Possible causes
The transaction is suspended when the cluster is leaderless because the candidates fail to receive the majority of votes from other nodes in the cluster during the leader election (the "minority problem"), the disk is full, or the memory is used up.
Suggested solutions
Check for the minority problem.
The minority problem occurs because of the OBServer exceptions or network failure. It also triggers the ob_cannot_connected alert at the same time.
In that case, you need to first solve the problem of the ob_cannot_connected alert by referring to the corresponding topic. Then, check whether the ob_host_exists_expired_trans alert is cleared 5 minutes later.
Check the disk space.
Insufficient disk space also triggers the following alerts at the same time. We recommend that you first solve the problems that caused the following alerts by referring to the respective topics, and then check whether the ob_host_exists_expired_trans alert is cleared 5 minutes later.
Check the memory.
Insufficient memory may also trigger the ob_host_mem_percent_over_threshold alert at the same time. For more information about how to solve this alert, see ob_host_mem_percent_over_threshold.
If the problem persists despite the preceding operations, run the following commands and send the result to OCP Technical Support for troubleshooting.
-- View the information of all servers to see if a server is faulty. select * from __all_server; -- View the suspended transaction. SELECT * FROM __all_virtual_trans_stat WHERE part_trans_action > 2 AND ctx_create_time < DATE_SUB(NOW(), INTERVAL 500 SECOND) LIMIT 100;