Description
If a bad disk is detected, the OBServer node will detect the failure (data disk or log disk) and mark it as a disk failure.
- Bad disk detection only applies to blocks with reference counts on the OBServer node. If the dd command writes to a data block, but the block is damaged, bad disk detection will not detect this damaged block.
- Bad disk detection consumes resources. To reduce resource consumption, only blocks that have been written for at least 2 days are detected.
Principle
| Parameter | Value |
|---|---|
| Monitoring metric | ob_disk_invalid_count |
| Metric source | Collected disk bad disk marks: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ count(1) as cnt from __all_virtual_disk_stat where is_disk_valid = 0 and svr_ip = ? and svr_port = ? |
| Collected metric | ob_disk_invalid_count |
| Monitoring expression | sum(ob_disk_invalid_count{@LABELS}) by (@GBLABELS) |
| Collection interval | 60 seconds |
Rule information
| Monitoring expression | Meaning of the monitoring metric | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| ob_disk_invalid_count > 0 | OBServer has a bad disk | 0 | 20 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Severe | Host |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=TEST-1:host=xxx.xxx.xxx.xxx OBServer has a bad disk
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}.
- Example: Cluster: TEST, Host: xxx.xxx.xxx.xxx, Alert: OBServer has a bad disk.
Alert recovery
- Template: Alert: ${alarm_name}, Number of hosts with bad disks: ${recover_value}
- Example: Alert: OBServer has a bad disk, Number of hosts with bad disks: 0
Impact on the system
When a bad disk is detected, the OBServer node will exit. Possible impacts:
- Business requests on the host with the bad disk will be interrupted. As long as the majority of nodes are working, the OBServer node can still provide services.
- Availability decreases. For example, a three-node OceanBase cluster will become a two-node cluster, leading to lower availability. In this case, the issue should be fixed or more nodes should be added as soon as possible.
Possible causes
No common causes.
Handling method
Confirm the cause of the bad disk:
select * from __all_virtual_bad_block_table where svr_ip = ? and svr_port = ?Check if the OBServer log contains the keyword:
error occours on data disk or slog disk, which will generate a related log alert.If a bad disk is detected, the OBServer node will exit. Before restarting the OBServer node, the bad disk must be fixed.