ob_host_invalid_disk_exists|V4.3.1| docs|Distributed Database

ob_host_invalid_disk_exists

Last Updated：2024-08-23 10:14:19 Updated

Description

This alert is triggered when the observer process detects a bad data disk or log disk of an OBServer node.

In bad disk detection, only blocks with reference counting enabled in the OBServer node are inspected. Bad blocks for which reference counting is disabled cannot be detected.
Bad disk detection consumes considerable resources. To reduce resource consumption, only blocks that have been written to the disk at least two days ago are inspected.

Parameter	Value
Metric	ob_disk_invalid_count
Source	Number of bad disk flags: select /+ MONITOR_AGENT READ_CONSISTENCY(WEAK) / count(1) as cnt from __all_virtual_disk_stat where is_disk_valid = 0 and svr_ip = ? and svr_port = ?
Collected metric	ob_disk_invalid_count
Metric expression	sum(ob_disk_invalid_count{@LABELS}) by (@GBLABELS)
Collection cycle	60 seconds

Metric expression	Metric description	Default threshold	Detection cycle	Time before clearance
ob_disk_invalid_count > 0	A bad disk exists on the OBServer node.	0	20 seconds	5 minutes

Trigger method	Alert level	Scope
Based on the expression of the metric	Critical	Host

Overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=TEST-1:host=xxx.xxx.xxx.xxx. A bad disk exists on the OBServer node.
Details
- Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}.
- Example: Cluster: TEST. Host: xxx.xxx.xxx.xxx. Alert: A bad disk exists on the OBServer node.

If a bad disk is detected, the observer process automatically exits the node. The possible impact is as follows:

The execution of service requests on the OBServer node are interrupted. Services are not interrupted provided that the majority of replicas are still available.
The availability is compromised. For example, the number of nodes of the OceanBase cluster is reduced from 3 to 2. In this case, you should fix the issue or add an available node as soon as possible.

None.

Confirm the cause of the disk damage:

select * from __all_virtual_bad_block_table where svr_ip = ? and svr_port = ?

Search the OBServer log for the error occurs on data disk or slog disk keyword.
If a bad disk is detected, the observer process automatically exits the node. You must repair the bad block on the disk before you start the OBServer node.