os_tsar_cpu_util_hwm|V4.3.0| docs|Distributed Database

os_tsar_cpu_util_hwm

Last Updated：2024-07-30 09:31:29 Updated

Description

This alert is triggered when CPU utilization (util + steal) exceeds 90%.

Parameter	Value
Metric	cpu_used_percent, cpu_steal
Source	This metric is a basic host monitoring metric collected by node_exporter. The metric is equivalent to CPU utilization (util = user + system + nice + irq + softirq) in the top command output.
Collected metric	node_cpu_seconds_total
Metric expression	cpu_used_percent: 100 * (1 - sum(rate(node_cpu_seconds_total{mode=~“idle\|iowait”,@LABELS}[@INTERVAL])) by (@GBLABELS)) cpu_steal: 100 * sum(rate(node_cpu_seconds_total{mode=“steal”, @LABELS}[@INTERVAL])) by (@GBLABELS)
Collection cycle	1 second

Metric expression	Metric description	Default threshold (unit: percentage)	Detection cycle	Elimination cycle
cpu_used_percent > 90 and cpu_steal < 20	CPU steal: the percentage of time when a virtual machine is waiting on the physical CPU for its CPU time. The percentage can be higher than the limit. CPU utilization (util), consisting of the following: user: the percentage of CPU time spent running user-mode code. system: the percentage of CPU time spent running kernel-mode code. nice: the percentage of CPU time spent changing process priority. The default value is 0. irq: the percentage of CPU time spent handling hardware interrupts. softirq: the percentage of CPU time spent handling software interrupts.	cpu_used_percent: 90% cpu_steal: 20%	10 seconds	5 minutes

Trigger method	Alert level	Scope
Based on the expression of the metric	Critical	Server

Overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_cpu_util_hwm
Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, CPU utilization ${cpu_used_percent_value_zh_cn} exceeds ${cpu_used_percent_alarm_threshold}%, percentage of CPU steal time ${cpu_steal_value_zh_cn} is lower than ${cpu_steal_alarm_threshold}%.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_cpu_util_hwm, CPU utilization 90.186% exceeds 90%, percentage of CPU steal time 0% is lower than 20%.

Longtime full CPU utilization will cause serious impact on system stability, or even cause the system to stop running.

The percentage of CPU time spent running user-mode code is high. This is generally because too many application processes exist. For example, a large number of concurrent requests are sent or a large amount of data is processed.
The percentage of CPU time spent running kernel-mode code is high. For example, a large number system calls are initiated.
The percentage of CPU time spent handling software interrupts is high. For example, network attacks occur.