os_tsar_cpu_util_hwm

2024-07-30 09:31:29  Updated

Description

This alert is triggered when CPU utilization (util + steal) exceeds 90%.

Principle

Parameter Value
Metric cpu_used_percent, cpu_steal
Source This metric is a basic host monitoring metric collected by node_exporter.
The metric is equivalent to CPU utilization (util = user + system + nice + irq + softirq) in the top command output.
Collected metric node_cpu_seconds_total
Metric expression
  • cpu_used_percent: 100 * (1 - sum(rate(node_cpu_seconds_total{mode=~“idle|iowait”,@LABELS}[@INTERVAL])) by (@GBLABELS))
  • cpu_steal: 100 * sum(rate(node_cpu_seconds_total{mode=“steal”, @LABELS}[@INTERVAL])) by (@GBLABELS)
  • Collection cycle 1 second

    Alert rule

    Metric expression Metric description Default threshold (unit: percentage) Detection cycle Elimination cycle
    cpu_used_percent > 90 and cpu_steal < 20 CPU steal: the percentage of time when a virtual machine is waiting on the physical CPU for its CPU time. The percentage can be higher than the limit.
    CPU utilization (util), consisting of the following:
  • user: the percentage of CPU time spent running user-mode code.
  • system: the percentage of CPU time spent running kernel-mode code.
  • nice: the percentage of CPU time spent changing process priority. The default value is 0.
  • irq: the percentage of CPU time spent handling hardware interrupts.
  • softirq: the percentage of CPU time spent handling software interrupts.
  • cpu_used_percent: 90%
  • cpu_steal: 20%
  • 10 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Based on the expression of the metric Critical Server

    Alert templates

    • Overview
      • Template: ${alarm_target} ${alarm_name}
      • Example: svr_ip=xxx.xxx.xxx.xxx os_tsar_cpu_util_hwm
    • Details
      • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: ${alarm_name}, CPU utilization ${cpu_used_percent_value_zh_cn} exceeds ${cpu_used_percent_alarm_threshold}%, percentage of CPU steal time ${cpu_steal_value_zh_cn} is lower than ${cpu_steal_alarm_threshold}%.
      • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: os_tsar_cpu_util_hwm, CPU utilization 90.186% exceeds 90%, percentage of CPU steal time 0% is lower than 20%.

    Impact on the system

    Longtime full CPU utilization will cause serious impact on system stability, or even cause the system to stop running.

    Possible causes

    • The percentage of CPU time spent running user-mode code is high. This is generally because too many application processes exist. For example, a large number of concurrent requests are sent or a large amount of data is processed.
    • The percentage of CPU time spent running kernel-mode code is high. For example, a large number system calls are initiated.
    • The percentage of CPU time spent handling software interrupts is high. For example, network attacks occur.

    Suggested solutions

    1. Run the top command to check which CPU utilization metric is high.
    2. If OBServer CPU utilization is high, check whether CPU-consuming operations such as major or minor compactions are performed. If CPU utilization is not as expected, contact the administrator.

    Contact Us