ob_host_invalid_disk_exists OBServer has a bad disk|V4.3.6| docs|Distributed Database

ob_host_invalid_disk_exists OBServer has a bad disk

Last Updated：2025-09-08 08:15:43 Updated

Description

If a bad disk is detected, the OBServer node will detect the failure (data disk or log disk) and mark it as a disk failure.

Bad disk detection only applies to blocks with reference counts on the OBServer node. If the dd command writes to a data block, but the block is damaged, bad disk detection will not detect this damaged block.
Bad disk detection consumes resources. To reduce resource consumption, only blocks that have been written for at least 2 days are detected.

Parameter	Value
Monitoring metric	ob_disk_invalid_count
Metric source	Collected disk bad disk marks: select /+ MONITOR_AGENT READ_CONSISTENCY(WEAK) / count(1) as cnt from __all_virtual_disk_stat where is_disk_valid = 0 and svr_ip = ? and svr_port = ?
Collected metric	ob_disk_invalid_count
Monitoring expression	sum(ob_disk_invalid_count{@LABELS}) by (@GBLABELS)
Collection interval	60 seconds

Monitoring expression	Meaning of the monitoring metric	Default threshold	Detection cycle	Elimination cycle
ob_disk_invalid_count > 0	OBServer has a bad disk	0	20 seconds	5 minutes

Alert trigger method	Alert level	Scope
Based on the expression of the monitoring metric	Severe	Host

Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=TEST-1:host=xxx.xxx.xxx.xxx OBServer has a bad disk
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}.
- Example: Cluster: TEST, Host: xxx.xxx.xxx.xxx, Alert: OBServer has a bad disk.
Alert recovery
- Template: Alert: ${alarm_name}, Number of hosts with bad disks: ${recover_value}
- Example: Alert: OBServer has a bad disk, Number of hosts with bad disks: 0

When a bad disk is detected, the OBServer node will exit. Possible impacts:

Business requests on the host with the bad disk will be interrupted. As long as the majority of nodes are working, the OBServer node can still provide services.
Availability decreases. For example, a three-node OceanBase cluster will become a two-node cluster, leading to lower availability. In this case, the issue should be fixed or more nodes should be added as soon as possible.

No common causes.

Confirm the cause of the bad disk:

select * from __all_virtual_bad_block_table where svr_ip = ? and svr_port = ?

Check if the OBServer log contains the keyword: error occours on data disk or slog disk, which will generate a related log alert.
If a bad disk is detected, the OBServer node will exit. Before restarting the OBServer node, the bad disk must be fixed.