ob_host_invalid_disk_exists

2025-03-26 07:47:21  Updated

Description

This alert is triggered when the observer process detects a bad data disk or log disk of an OBServer node.

  • In bad disk detection, only blocks with reference counting enabled in the OBServer node are inspected. Bad blocks for which reference counting is disabled cannot be detected.
  • Bad disk detection consumes considerable resources. To reduce resource consumption, only blocks that have been written to the disk at least two days ago are inspected.

Principle

Parameter Value
Metric ob_disk_invalid_count
Source Number of bad disk flags: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ count(1) as cnt from __all_virtual_disk_stat where is_disk_valid = 0 and svr_ip = ? and svr_port = ?
Collected metric ob_disk_invalid_count
Metric expression sum(ob_disk_invalid_count{@LABELS}) by (@GBLABELS)
Collection cycle 60 seconds

Alert rule

Metric expression Metric description Default threshold Detection cycle Time before clearance
ob_disk_invalid_count > 0 A bad disk exists on the OBServer node. 0 20 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Critical Host

Alert templates

  • Overview

    • Template: ${alarm_target} ${alarm_name}
    • Example: alarm_template_id=0:ob_cluster=TEST-1:host=xxx.xxx.xxx.xxx. A bad disk exists on the OBServer node.
  • Details

    • Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}.
    • Example: Cluster: TEST. Host: xxx.xxx.xxx.xxx. Alert: A bad disk exists on the OBServer node.

Impact on the system

If a bad disk is detected, the observer process automatically exits the node. The possible impact is as follows:

  1. The execution of service requests on the OBServer node are interrupted. Services are not interrupted provided that the majority of replicas are still available.
  2. The availability is compromised. For example, the number of nodes of the OceanBase cluster is reduced from 3 to 2. In this case, you should fix the issue or add an available node as soon as possible.

Possible causes

None.

Solutions

  1. Confirm the cause of the disk damage:

    select * from __all_virtual_bad_block_table where svr_ip = ? and svr_port = ?
    
  2. Search the OBServer log for the error occurs on data disk or slog disk keyword.

  3. If a bad disk is detected, the observer process automatically exits the node. You must repair the bad block on the disk before you start the OBServer node.

Contact Us