os_tsar_nvme_ioawait

2025-03-26 07:47:21  Updated

Description

Among the important disk I/O metrics:

  • svctm reflects the disk performance, which is usually lower than 20 ms.
  • io_await reflects the I/O processing efficiency. If io_await is small, the I/O queue is not congested. When io_await approaches svctm, waiting in the I/O queue is almost not required. An io_await value slightly greater than svctm is acceptable.
  • io_util is the percentage of CPU time during which I/O requests were issued to the device in the unit time, which indicates I/O busyness.
  • io_qusize indicates the size of the I/O queue. A larger queue size probably indicates a higher io_await value.

This alert is triggered when io_await is greater than or equal to 20 ms, io_util is greater than or equal to 98%, and io_qusize is greater than or equal to 20 in 30 consecutive seconds.

Principle

Parameter Value
Metric io_util, io_qusize, io_await
Source These metrics are basic host monitoring metrics collected by node_exporter.
The metrics are equivalent to metrics in the iostat -dx command output. To be specific, io_util is equivalent to util, io_qusize is equivalent to avgqu-sz, and io_await is equivalent to await.
Collected metric io_util, node_disk_io_time_weighted_seconds_total, io_await
Metric expression
  • io_util: avg(io_util{@LABELS}) by (@GBLABELS)
  • io_qusize: sum(delta(node_disk_io_time_weighted_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)
  • io_await: avg(io_await{@LABELS}) by (@GBLABELS)
  • Collection cycle 1 second

    Alert rule

    Metric expression Metric description Default threshold Detection cycle Elimination cycle
    io_await > 20 and io_util >= 98 and io_qusize >= 20
  • io_util: the percentage of CPU time during which I/O requests were issued to the disk in the unit time. Calculation formula: (r/s + w/s) × svctm/1000. In this formula, r/s means the number of reads per second and w/s means the number of writes per second.
  • svctm: the I/O response time of the disk, in ms.
  • io_await: the I/O waiting time of the disk, in ms.
  • io_qusize: the size of the I/O queue.
  • io_util: 98%
  • io_await: 20 ms
  • io_qusize: 20
  • 10 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Based on the expression of the metric Critical Server

    Alert templates

    • Overview
      • Template: ${alarm_target} ${alarm_name}
      • Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx:device=dm-0 os_tsar_nvme_ioawait
    • Details
      • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, OceanBase Database disk: ${ob_device}, device ${device}, io await ${io_await_value_zh_cn} exceeds ${io_await_alarm_threshold} ms, io util ${io_util_value_zh_cn} exceeds ${io_util_alarm_threshold} %, io qusize ${io_qusize_value} exceeds ${io_qusize_alarm_threshold}.
      • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_nvme_ioawait, OceanBase Database disk: ob_log, device dm-0, io await 30 ms exceeds 20 ms, io util 99 % exceeds 98 %, io qusize 33 exceeds 20.

    Impact on the system

    If io_util is greater than 98%, the I/O device runs at full load. If io_await is greater than 20 ms, the processing efficiency of the disk is low or a large number of I/O requests are initiated. This may affect the progress of major and minor compactions.

    In OceanBase Database, disks are classified into the data disk (/data/1), log disk (/data/log1), and installation disk (/home). This alert takes effect for the data and log disks. When an I/O exception occurs on the data disk or log disk, the services of the OBServer are affected.

    Possible causes

    1. The disk performance is poor.
    2. The CPU has poor performance and cannot process a large number of I/O requests.
    3. A large number of concurrent I/O requests are initiated.

    Suggested solutions

    1. Check whether the problem is caused by poor disk or CPU performance. If yes, you can upgrade the hardware resources to resolve the problem.

    2. Check whether the request traffic is as expected and identify the cause of the I/O exception.

      # View the I/O statistics.
      iostat -dx
      
      # View the processes with high I/O.
      iotop
      
      # View the files written by the processes and check whether the write volume is as expected.
      lsof -p <pid>
      

    Contact Us