os_kernel_io_hang

2025-03-26 07:47:21  Updated

Description

Some metrics for identifying an I/O hang are defined in OceanBase Cloud Platform (OCP). This alert is triggered when these metrics stay over their thresholds for 30 seconds. Description of I/O hang metrics:

  • io_util >= 99: The I/O queue runs at full load.
  • io_qusize >= 20: The I/O queue has data read or write requests to process.
  • io_read_write_time == 0: I/O writes or reads are stopped (with no response sent).

Principle

Parameter Value
Metric io_util, io_qusize, io_read_write_time
Source These metrics are basic host monitoring metrics collected by node_exporter.
The metrics are equivalent to metrics in the iostat -dx command output. To be specific, io_util is equivalent to util, io_qusize is equivalent to avgqu-sz, and io_read_write_time is equivalent to rkB/s and wkB/s.
Collected metric io_util, node_disk_io_time_weighted_seconds_total, node_disk_read_time_seconds_total, node_disk_write_time_seconds_total
Metric expression
  • io_util: avg(io_util{@LABELS}) by (@GBLABELS)
  • io_qusize: sum(delta(node_disk_io_time_weighted_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)
  • io_read_write_time: avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS) + avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)
  • Collection cycle 1 second

    Alert rule

    Metric expression Metric description Default threshold Detection cycle Elimination cycle
    io_util >= 99 and io_qusize >= 20 and io_read_write_time == 0
  • io_util: the percentage of CPU time during which I/O requests were issued to the disk in the unit time. Calculation formula: (r/s + w/s) × svctm/1000. In this formula, r/s means the number of reads per second and w/s means the number of writes per second. svctm: the I/O response time.
  • io_qusize: the size of the I/O queue.
  • io_read_write_time: the total time consumed by I/O reads and writes.
  • io_util: 99%
  • io_qusize: 20
  • io_read_write_time: 0
  • 10 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Based on the expression of the metric Critical Server

    Alert templates

    • Overview
      • Template: ${alarm_target} ${alarm_name}
      • Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx:device=dm-1 os_kernel_io_hang
    • Details
      • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, OceanBase Database disk: ${ob_device}, device ${device}, io util ${io_util_value_zh_cn} exceeds ${io_util_alarm_threshold}%, io qusize ${io_qusize_value} exceeds ${io_qusize_alarm_threshold}.
      • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_kernel_io_hang, OceanBase Database disk: ob_data, device dm-1, io util 99.1% exceeds 99%, io qusize 30 exceeds 20.

    Impact on the system

    I/O hangs will stop reads and writes. This is fatal to processes because data cannot be stored in disks and OBServers will become unavailable.

    In OceanBase Database, disks are classified into the data disk (/data/1), log disk (/data/log1), and installation disk (/home). This alert takes effect for the data and log disks. When an I/O hang occurs on the data or log disk, OBServers will become unavailable.

    Possible causes

    • A disk in OceanBase Database is damaged.
    • Other issues occur.

    Suggested solutions

    Disconnect the faulty OBServer.

    Contact Us