Description
This alert is triggered when CPU utilization (util + steal) exceeds 98%.
Principle
| Parameter | Value |
|---|---|
| Metric | cpu_used_percent |
| Source | This metric is a basic host monitoring metric collected by node_exporter. The metric is equivalent to CPU utilization (util = user + system + nice + irq + softirq) in the top command output. |
| Collected metric | node_cpu_seconds_total |
| Metric expression | 100 * (1 - sum(rate(node_cpu_seconds_total{mode=~"idle|iowait",@LABELS}[@INTERVAL])) by (@GBLABELS)) |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold (unit: percentage) | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| cpu_used_percent >= 98 | CPU utilization, which consists of the following: |
98% | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_cpu_util_full
- Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, CPU utilization ${value_shown} exceeds ${alarm_threshold}%.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_cpu_util_full, CPU utilization 99.178% exceeds 98%.
Impact on the system
Longtime full CPU utilization will cause serious impact on system stability, or even cause the system to stop running.
Possible causes
- The percentage of CPU time spent running user-mode code is high. This is generally because too many application processes exist. For example, a large number of concurrent requests are sent or a large amount of data is processed.
- The percentage of CPU time spent running kernel-mode code is high. For example, a large number system calls are initiated.
- The percentage of CPU time spent handling software interrupts is high. For example, network attacks occur.
Suggested solutions
- Run the
topcommand to check which CPU utilization metric is high. - If OBServer CPU utilization is high, check whether CPU-consuming operations such as major or minor compactions are performed. If CPU utilization is not as expected, contact the administrator.