ob_cluster_no_frozen OceanBase cluster not frozen for a long time|V4.3.6| docs|Distributed Database

ob_cluster_no_frozen OceanBase cluster not frozen for a long time

Last Updated：2025-09-08 08:15:43 Updated

Alert description

This alert monitors whether an OceanBase cluster in OCP has not been frozen for a long time. If the time elapsed since the last freeze exceeds the threshold (default 90000s), the alert is triggered.

Alerting principle

The following table describes the key parameters involved in the alerting monitoring logic.

Parameter	Value
Monitoring metric	ob_cluster_no_frozen_seconds
Metric source	SQL: `select zone, name, value, time_to_usec(now()) from __all_zone;` zone_value is the value of the value field when the name is frozen_time. current_timestamp is the value of the time_to_usec(now()) field when the name is frozen_time.
Collected metric (unit: microseconds)	current_timestamp, zone_value
Monitoring expression	(max(current_timestamp{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS) - max(zone_value{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS)) / 1000000
Collection interval	1 second

The value of the monitoring metric ob_cluster_no_frozen_seconds indicates how long the cluster has not been frozen. An alert is triggered when this value exceeds the threshold (which defaults to 90000 seconds).

Rule information

Monitoring metric	Default threshold (unit: seconds)	Duration	Detection cycle	Elimination cycle
ob_cluster_no_frozen_seconds	90000	0 seconds	60 seconds	5 minutes

Alert information

Alert trigger method	Alert level	Scope
Based on the expression of the monitoring metric	Critical	Cluster

Alert template

Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1 OceanBase cluster freezing detection failed
Alert details
- Template: Cluster: ${ob_cluster_name}, Alert: ${alarm_name}, Time since last freeze: ${value_shown} seconds, which exceeds ${alarm_threshold} seconds.
- Example: Cluster: obcluster-1, Alert: OceanBase cluster freezing detection failed. Time since last freeze: 90001.0 seconds, which exceeds 90000.0 seconds.
Alert recovery
- Template: Alert: ${alarm_name}, Time since last freeze: ${value_shown} seconds
- Example: Alert: OceanBase cluster freezing detection failed, Time since last freeze: 70000.0 seconds

Impact on the system

If the freeze or major compaction is not initiated for a long time, the disk space will increase, leading to disk full, which affects business write.

Possible causes

Daily compaction is disabled or major compaction mode is enabled.
RootService is unavailable, such as no leader or abnormal process.

Procedure

Check whether daily major compactions are enabled. In the Major Compaction Strategy section on the Major Compaction Strategy page, check whether Major Compaction Time is specified.
- If not specified, automatic freeze triggering is not possible. Please turn this on manually.
- If specified, it may be that another reason is at fault. Proceed to step 2.
Check whether the Root Service is down due to process exceptions.

If the observer process is down, an alert indicating that an OBServer node does not work in the OB cluster is reported. ob_cluster_exists_inactive_server. You can follow the instructions in the documentation and observe whether this alert is still reported after 5 minutes.
Check whether the Root Service is unowned.

Run the following query to check whether the Root Service is unowned:
```
-- Connect to the sys tenant for querying
SELECT svr_ip, zone, role, member_list FROM __all_virtual_core_meta_table;
```
- If the connection is normal but the query fails, the Root Service is in the unowned state.
  
  You can attempt to recover it by restarting the OBServer node that hosts the Root Service, or by restarting all the OBServer nodes in the rootservice_list.
  1. Find the IP address of the OBServer node that hosts the Root Service by using the following method:
```
-- Connect to the sys tenant for querying

-- Query the OBServer node that hosts the Root Service
SELECT zone, svr_ip, svr_port
FROM __all_server WHERE with_rootserver=1;

-- Query the rootservce list
-- rootservice_list is a string that uses ';' to separate each OBServer node.
SELECT DISTINCT `value` AS rootservice_list
FROM __all_virtual_sys_parameter_stat
WHERE `name` = 'rootservice_list';
```
  2. On the Restart tab of the OBServers panel on the Overview page, click the Restart icon of the OBServer node.
Note

If you choose to forcibly restart the OBServer node, it will be restarted directly, with the process being terminated. This is required when the majority of OBServer nodes cannot work. If the majority of OBServer nodes cannot work and the OBServer node is restarted, some replicas may become unowned. After the OBServer node is restarted, you may need to wait 15 minutes before you can connect to the sys tenant. Business tenants may also experience a similar issue.
1. If the OBServer node has not been recovered after a restart or forcibly restarted and waited 15 minutes, perform step 3 again.
- If the Root Service is not in the unowned state, another reason is likely at fault. Proceed to step 4.
Collect log information from the OBServer node and contact technical support for analysis and resolution.