ob_host_cpu_percent_over_threshold

2025-03-28 08:05:55  Updated

Description

This alert is triggered when CPU utilization of the OBServer exceeds the threshold.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_host_cpu_percent
Source The data is collected by using the node_exporter process.
Collected metric node_cpu_seconds_total
Metric expression 100 * (1 - sum(rate(node_cpu_seconds_total{mode="idle", @LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS) / sum(rate(node_cpu_seconds_total{@LABELS}[@INTERVAL]) by (@GBLABELS)) by (@GBLABELS))
Note
Data in the metric expression is identified based on labels. The following labels are used:
  • user
  • idle
  • system
  • iowait
  • irq
  • nice
  • softirq
  • steal
  • The idle label indicates the idle time of the CPU, and the sum of the time of the CPU in other states is referred to as the CPU running time.
    Collection cycle 1 second

    The value of the metric ob_host_cpu_percent indicates the CPU utilization of the OBServer. When the CPU utilization exceeds the threshold, this alert is triggered. The default threshold is 100%.

    Alert rule

    Metric Default threshold (unit: %) Duration Detection cycle Time before clearance
    ob_host_cpu_percent 100 60 seconds 60 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Metric expression Critical Server

    Alert templates

    • Overview: ${alarm_target} ${alarm_name}

    • Details: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: The CPU utilization is ${value}%, exceeding the threshold of ${alarm_threshold}%.

    • Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The CPU utilization of the OBServer exceeds the threshold.

    • Details example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: The CPU utilization is 101.0%, exceeding the threshold of 100.0%.

    Impact on the system

    Transient high CPU utilization has little impact on the system. However, lasting high CPU utilization reduces the system throughput and increases the request latency.

    Possible causes

    This problem is commonly found in the following scenarios:

    • The OBServer executes complex SQL statements.

    • Other applications running on the host use too many CPU resources.

    Suggested solutions

    1. Check whether the alert is caused by high CPU utilization of the observer process.

      Run the top command on the OBServer where the alert is triggered to check for processes that use too many CPU resources.

      • If the observer process uses too many CPU resources, the following alerts may be simultaneously triggered:

        Resolve the preceding two alerts based on the suggested solutions provided in the respective topics, and then check whether the ob_host_cpu_percent_over_threshold alert persists.

        • If the alert persists, go to the next step.

        • If the alert is cleared, the issue is fixed.

      • If other processes use too many CPU resources, go to the next step.

    2. Contact the database administrator or an O&M engineer to disable processes irrelevant to the application.

    Contact Us