Description
This alert is triggered when CPU utilization (util + steal) exceeds 90%.
Principle
| Parameter | Value |
|---|---|
| Metric | cpu_used_percent, cpu_steal |
| Source | This metric is a basic host monitoring metric collected by node_exporter. The metric is equivalent to CPU utilization (util = user + system + nice + irq + softirq) in the top command output. |
| Collected metric | node_cpu_seconds_total |
| Metric expression | |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold (unit: percentage) | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| cpu_used_percent > 90 and cpu_steal < 20 | CPU steal: the percentage of time when a virtual machine is waiting on the physical CPU for its CPU time. The percentage can be higher than the limit. CPU utilization (util), consisting of the following: |
10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_cpu_util_hwm
- Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, CPU utilization ${cpu_used_percent_value_zh_cn} exceeds ${cpu_used_percent_alarm_threshold}%, percentage of CPU steal time ${cpu_steal_value_zh_cn} is lower than ${cpu_steal_alarm_threshold}%.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_cpu_util_hwm, CPU utilization 90.186% exceeds 90%, percentage of CPU steal time 0% is lower than 20%.
Impact on the system
Longtime full CPU utilization will cause serious impact on system stability, or even cause the system to stop running.
Possible causes
- The percentage of CPU time spent running user-mode code is high. This is generally because too many application processes exist. For example, a large number of concurrent requests are sent or a large amount of data is processed.
- The percentage of CPU time spent running kernel-mode code is high. For example, a large number system calls are initiated.
- The percentage of CPU time spent handling software interrupts is high. For example, network attacks occur.
Suggested solutions
- Run the
topcommand to check which CPU utilization metric is high. - If OBServer CPU utilization is high, check whether CPU-consuming operations such as major or minor compactions are performed. If CPU utilization is not as expected, contact the administrator.