Alert Description
This alert is triggered when the time spent on CPU soft interrupts on a specific CPU exceeds 1 minute. Soft interrupts should be distributed across multiple CPUs to prevent any single CPU from being monopolized by soft interrupts.
Alerting Principles
| Parameter | Value |
|---|---|
| Monitoring metrics | cpu_softirq_per_cpu, cpu_idle_per_cpu |
| Metric source | Host-level basic monitoring, collected from node_exporter Similar to the CPU usage metrics in the top command, such as CPU-si (percentage of soft interrupts) and CPU-id (idle rate) |
| Metrics collection | node_cpu_seconds_total_native |
| Monitoring Expression | |
| Sampling Interval | 1 Second |
Rule information
| Monitoring Expression | Description | Default Threshold (in Percent) | Detection Cycle | Removal Cycle |
|---|---|---|---|---|
| cpu_softirq_per_cpu > 3 and cpu_idle_per_cpu < 6 | 10 seconds | 5 minutes |
Alert Information
| Alert Trigger Method | Alert Level | Scope |
|---|---|---|
| Expressions based on monitoring metrics | Urgent | Server |
Alert Template
- Alert Summary
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx:cpu=3 indicates that server CPU soft interrupts did not split.
- Alert Details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}. CPU ${cpu}, CPU SoftIRQ Usage ${cpu_softirq_per_cpu_value_zh_cn}, CPU Idle ${cpu_idle_per_cpu_value_zh_cn}.
- Sample: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: Soft interrupts on CPU 3 of the server are not offloaded. The soft interrupt utilization on CPU 3 is 100 % and the CPU idle rate is 0 %.
- Alert resolved
- Template: Alert: ${alarm_name}, the softirq utilization rate of each CPU on the server: ${cpu_softirq_per_cpu_value_zh_cn}, the idle rate of each CPU on the server: ${cpu_idle_per_cpu_value_zh_cn}
- Sample: Alert: SoftIRQs on CPU cores are not offloaded. The usage rate of softIRQs on each CPU is 1%, and the idle rate of each CPU is 8%.
System impact
Interrupts can be classified into hardware interrupts and software interrupts.
Soft interrupts, such as TIMER, NET (rx/tx), TASKLET, SCHED, and RCU, as well as other types, are common in network rx/tx. Soft interrupts can lead to a high CPU usage rate.
Possible causes
It depends on the specific situation. For example:
- Invalid network requests. If the server is offline but the traffic is still sent to the server, massive network packets will be lost.
- Network attacks, such as SYN FLOOD attacks.
Procedure
Check the changes in /proc/softirqs by comparing the changes to identify the specific type of interrupt:
watch -n 5 'cat /proc/softirqs'If the network is disconnected, you can use sar and tcpdump for further analysis.