Description
Among the important disk I/O metrics:
- svctm reflects the disk performance, which is usually lower than 20 ms.
- io_await reflects the I/O processing efficiency. If io_await is small, the I/O queue is not congested. When io_await approaches svctm, waiting in the I/O queue is almost not required. An io_await value slightly greater than svctm is acceptable.
- io_util is the percentage of CPU time during which I/O requests were issued to the device in the unit time, which indicates I/O busyness.
This alert is triggered if io_awai is not smaller than 200 ms and io_util not smaller than 99% in 30 consecutive seconds.
Principle
| Parameter | Value |
|---|---|
| Metric | io_util, io_qusize, io_read_write_time |
| Source | These metrics are basic host monitoring metrics collected by node_exporter. The metrics are equivalent to metrics in the iostat -dx command output. To be specific, io_util is equivalent to util, io_qusize is equivalent to avgqu-sz, and io_read_write_time is equivalent to rkB/s and wkB/s. |
| Collected metric | io_util, node_disk_io_time_weighted_seconds_total, node_disk_read_time_seconds_total, node_disk_write_time_seconds_total |
| Metric expression | |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| io_await >= 200 and io_util >= 99 | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx:device=dm-0 os_tsar_sda_ioawait
- Alert details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, OceanBase Database disk: ${ob_device}, device ${device}, io await ${io_await_value_zh_cn} exceeds ${io_await_alarm_threshold} ms, io util ${io_util_value_zh_cn} exceeds ${io_util_alarm_threshold} %.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_sda_ioawait, OceanBase Database disk: ob_log, device dm-0, io await 300 ms exceeds 200 ms, io util 99.1% exceeds 99%.
Impact on the system
If io_util is greater than 99%, the I/O device runs at full load. If io_await is greater than 200 ms, the processing efficiency of the disk is low or a large number of I/O requests are initiated. This may affect the progress of major and minor compactions.
In OceanBase Database, disks are classified into the data disk (/data/1), log disk (/data/log1), and installation disk (/home). This alert takes effect for the data and log disks. When an I/O exception occurs on the data disk or log disk, the services of the OBServer are affected.
Possible causes
- The disk performance is poor.
- The CPU has poor performance and cannot process a large number of I/O requests.
- A large number of concurrent I/O requests are initiated.
Suggested solutions
- Check whether the problem is caused by poor disk or CPU performance. If yes, you can upgrade the hardware resources to resolve the problem.
- Check whether the request traffic is as expected and identify the cause of the I/O exception.
# View the I/O statistics. iostat -dx # View the processes with high I/O. iotop # View the files written by the processes and check whether the write volume is as expected. lsof -p <pid>