Description
This alert is triggered when the observer process detects a bad data disk or log disk of an OBServer node.
- In bad disk detection, only blocks with reference counting enabled in the OBServer node are inspected. Bad blocks for which reference counting is disabled cannot be detected.
- Bad disk detection consumes considerable resources. To reduce resource consumption, only blocks that have been written to the disk at least two days ago are inspected.
Principle
| Parameter | Value |
|---|---|
| Metric | ob_disk_invalid_count |
| Source | Number of bad disk flags: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ count(1) as cnt from __all_virtual_disk_stat where is_disk_valid = 0 and svr_ip = ? and svr_port = ? |
| Collected metric | ob_disk_invalid_count |
| Metric expression | sum(ob_disk_invalid_count{@LABELS}) by (@GBLABELS) |
| Collection cycle | 60 seconds |
Alert rule
| Metric expression | Metric description | Default threshold | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_disk_invalid_count > 0 | A bad disk exists on the OBServer node. | 0 | 20 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Host |
Alert templates
Overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=TEST-1:host=xxx.xxx.xxx.xxx. A bad disk exists on the OBServer node.
Details
- Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}.
- Example: Cluster: TEST. Host: xxx.xxx.xxx.xxx. Alert: A bad disk exists on the OBServer node.
Impact on the system
If a bad disk is detected, the observer process automatically exits the node. The possible impact is as follows:
- The execution of service requests on the OBServer node are interrupted. Services are not interrupted provided that the majority of replicas are still available.
- The availability is compromised. For example, the number of nodes of the OceanBase cluster is reduced from 3 to 2. In this case, you should fix the issue or add an available node as soon as possible.
Possible causes
None.
Solutions
Confirm the cause of the disk damage:
select * from __all_virtual_bad_block_table where svr_ip = ? and svr_port = ?Search the OBServer log for the
error occurs on data disk or slog diskkeyword.If a bad disk is detected, the observer process automatically exits the node. You must repair the bad block on the disk before you start the OBServer node.