Alert description
This alert monitors whether an OceanBase cluster in OCP has not been frozen for a long time. If the time elapsed since the last freeze exceeds the threshold (default 90000s), the alert is triggered.
Alerting principle
The following table describes the key parameters involved in the alerting monitoring logic.
| Parameter | Value |
|---|---|
| Monitoring metric | ob_cluster_no_frozen_seconds |
| Metric source | SQL: select zone, name, value, time_to_usec(now()) from __all_zone;
|
| Collected metric (unit: microseconds) | current_timestamp, zone_value |
| Monitoring expression | (max(current_timestamp{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS) - max(zone_value{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS)) / 1000000 |
| Collection interval | 1 second |
The value of the monitoring metric ob_cluster_no_frozen_seconds indicates how long the cluster has not been frozen. An alert is triggered when this value exceeds the threshold (which defaults to 90000 seconds).
Rule information
| Monitoring metric | Default threshold (unit: seconds) | Duration | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| ob_cluster_no_frozen_seconds | 90000 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Critical | Cluster |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1 OceanBase cluster freezing detection failed
Alert details
- Template: Cluster: ${ob_cluster_name}, Alert: ${alarm_name}, Time since last freeze: ${value_shown} seconds, which exceeds ${alarm_threshold} seconds.
- Example: Cluster: obcluster-1, Alert: OceanBase cluster freezing detection failed. Time since last freeze: 90001.0 seconds, which exceeds 90000.0 seconds.
Alert recovery
- Template: Alert: ${alarm_name}, Time since last freeze: ${value_shown} seconds
- Example: Alert: OceanBase cluster freezing detection failed, Time since last freeze: 70000.0 seconds
Impact on the system
If the freeze or major compaction is not initiated for a long time, the disk space will increase, leading to disk full, which affects business write.
Possible causes
Daily compaction is disabled or major compaction mode is enabled.
RootService is unavailable, such as no leader or abnormal process.
Procedure
Check whether daily major compactions are enabled. In the Major Compaction Strategy section on the Major Compaction Strategy page, check whether Major Compaction Time is specified.
If not specified, automatic freeze triggering is not possible. Please turn this on manually.
If specified, it may be that another reason is at fault. Proceed to step 2.
Check whether the Root Service is down due to process exceptions.
If the observer process is down, an alert indicating that an OBServer node does not work in the OB cluster is reported. ob_cluster_exists_inactive_server. You can follow the instructions in the documentation and observe whether this alert is still reported after 5 minutes.
Check whether the Root Service is unowned.
Run the following query to check whether the Root Service is unowned:
-- Connect to the sys tenant for querying SELECT svr_ip, zone, role, member_list FROM __all_virtual_core_meta_table;If the connection is normal but the query fails, the Root Service is in the unowned state.
You can attempt to recover it by restarting the OBServer node that hosts the Root Service, or by restarting all the OBServer nodes in the rootservice_list.
Find the IP address of the OBServer node that hosts the Root Service by using the following method:
-- Connect to the sys tenant for querying -- Query the OBServer node that hosts the Root Service SELECT zone, svr_ip, svr_port FROM __all_server WHERE with_rootserver=1; -- Query the rootservce list -- rootservice_list is a string that uses ';' to separate each OBServer node. SELECT DISTINCT `value` AS rootservice_list FROM __all_virtual_sys_parameter_stat WHERE `name` = 'rootservice_list';On the Restart tab of the OBServers panel on the Overview page, click the Restart icon of the OBServer node.
Note
If you choose to forcibly restart the OBServer node, it will be restarted directly, with the process being terminated. This is required when the majority of OBServer nodes cannot work. If the majority of OBServer nodes cannot work and the OBServer node is restarted, some replicas may become unowned. After the OBServer node is restarted, you may need to wait 15 minutes before you can connect to the sys tenant. Business tenants may also experience a similar issue.
- If the OBServer node has not been recovered after a restart or forcibly restarted and waited 15 minutes, perform step 3 again.
- If the Root Service is not in the unowned state, another reason is likely at fault. Proceed to step 4.
Collect log information from the OBServer node and contact technical support for analysis and resolution.