Description
This alert is triggered when software interrupts have been executed on a CPU for more than 1 minute. Software interrupts should be distributed to multiple CPUs for execution, to prevent a single CPU from being exclusively occupied.
Principle
| Parameter | Value |
|---|---|
| Metric | cpu_softirq_per_cpu, cpu_idle_per_cpu |
| Source | The metrics are basic host monitoring metrics collected by node_exporter. cpu_softirq_per_cpu is equivalent to Cpu-si (percentage of CPU time spent handling software interrupts) in the top command output and cpu_idle_per_cpu is equivalent to Cpu-id (percentage of CPU time spent idle) in the top command output. |
| Collected metric | node_cpu_seconds_total_native |
| Metric expression | |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold (unit: percentage) | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| cpu_softirq_per_cpu > 3 and cpu_idle_per_cpu < 6 | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx:cpu=3 os_cpu_irq_error
- Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, CPU ${cpu}, percentage of CPU time spent handling software interrupts ${cpu_softirq_per_cpu_value_zh_cn}, percentage of CPU time spent idle ${cpu_idle_per_cpu_value_zh_cn}.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_cpu_irq_error, CPU 3, percentage of CPU time spent handling software interrupts 100%, percentage of CPU time spent idle 0%.
Impact on the system
Interrupts are divided into hardware interrupts and software interrupts.
Software interrupts include TIMER, NET_RX, NET_TX, TASKLET, SCHED, RCU, and others. Common software interrupts occur during network packet receiving and transmission. Software interrupts may increase CPU utilization.
Possible causes
Check for the following possible causes:
- Invalid network requests. If a server is offline but traffic is still forwarded to the server, a large number of network packet receiving interrupts occur.
- Network attacks, such as a SYN flood attack.
Suggested solutions
Check for changes in /proc/softirqs to identify the interrupt.
watch -n 5 'cat /proc/softirqs'For a network interrupt, use System Activity Reporter (Sar) or tcpdump for further analysis.