os_tsar_sda_ioawait|V4.3.3| docs|Distributed Database

os_tsar_sda_ioawait

Last Updated：2025-01-10 06:15:54 Updated

Description

Among the important disk I/O metrics:

svctm reflects the disk performance, which is usually lower than 20 ms.
io_await reflects the I/O processing efficiency. If io_await is small, the I/O queue is not congested. When io_await approaches svctm, waiting in the I/O queue is almost not required. An io_await value slightly greater than svctm is acceptable.
io_util is the percentage of CPU time during which I/O requests were issued to the device in the unit time, which indicates I/O busyness.

This alert is triggered if io_awai is not smaller than 200 ms and io_util not smaller than 99% in 30 consecutive seconds.

Principle

Parameter	Value
Metric	io_util, io_qusize, io_read_write_time
Source	These metrics are basic host monitoring metrics collected by node_exporter. The metrics are equivalent to metrics in the `iostat -dx` command output. To be specific, io_util is equivalent to util, io_qusize is equivalent to avgqu-sz, and io_read_write_time is equivalent to rkB/s and wkB/s.
Collected metric	io_util, node_disk_io_time_weighted_seconds_total, node_disk_read_time_seconds_total, node_disk_write_time_seconds_total
Metric expression	io_util: avg(io_util{@LABELS}) by (@GBLABELS) io_qusize: sum(delta(node_disk_io_time_weighted_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS) io_read_write_time: avg(rate(node_disk_read_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS) + avg(rate(node_disk_write_time_seconds_total{@LABELS}[@INTERVAL])) by (@GBLABELS)
Collection cycle	1 second

Alert rule

Metric expression	Metric description	Default threshold	Detection cycle	Elimination cycle
io_await >= 200 and io_util >= 99	io_util: the percentage of CPU time during which I/O requests were issued to the disk in the unit time. Calculation formula: (r/s + w/s) × svctm/1000. In this formula, r/s means the number of reads per second and w/s means the number of writes per second. svctm: the I/O response time of the disk, in ms. io_await: the I/O waiting time of the disk, in ms.	io_util: 99% io_await: 200 ms	10 seconds	5 minutes

Alert information

Trigger method	Alert level	Scope
Based on the expression of the metric	Critical	Server

Alert templates

Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx:device=dm-0 os_tsar_sda_ioawait
Alert details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, OceanBase Database disk: ${ob_device}, device ${device}, io await ${io_await_value_zh_cn} exceeds ${io_await_alarm_threshold} ms, io util ${io_util_value_zh_cn} exceeds ${io_util_alarm_threshold} %.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_sda_ioawait, OceanBase Database disk: ob_log, device dm-0, io await 300 ms exceeds 200 ms, io util 99.1% exceeds 99%.

Impact on the system

If io_util is greater than 99%, the I/O device runs at full load. If io_await is greater than 200 ms, the processing efficiency of the disk is low or a large number of I/O requests are initiated. This may affect the progress of major and minor compactions.

In OceanBase Database, disks are classified into the data disk (/data/1), log disk (/data/log1), and installation disk (/home). This alert takes effect for the data and log disks. When an I/O exception occurs on the data disk or log disk, the services of the OBServer are affected.

Possible causes

The disk performance is poor.
The CPU has poor performance and cannot process a large number of I/O requests.
A large number of concurrent I/O requests are initiated.