Alert description
This alert indicates that the CPU utilization of an OCP node has exceeded the threshold.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | ocp_system_cpu_usage: the CPU utilization of the OCP node. An alert is triggered when the utilization exceeds the threshold. |
| Monitoring Expression | 100 * sum(system_cpu_usage{@LABELS}) by (@GBLABELS) |
| Metric Collection | system_cpu_usage |
| Metric Source | The OCP process uses the spring-boot-starter-actuator component to collect CPU data. This metric is typically collected using tools or libraries provided by the operating system. For example, on Linux, you can use theuptimeCommand or Read/proc/statfiles) to obtain. |
| Collection Cycle | 5 Seconds |
Rule information
Monitoring Metrics |
Default Threshold (Unit: %) |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| ocp_system_cpu_usage | This metric has two default thresholds: |
60 Seconds | 10 Seconds | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | service |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:svr_ip=xx.xx.xx.xx:svr_port=8080 OCP Node CPU Usage Exceeds Threshold
Alert Details
- Template: Alert: ${alarm_name}, CPU percentage ${value_shown} exceeds ${alarm_threshold} %.
- Example: Alert: OCP Node CPU Usage Exceeds Threshold, CPU Percentage 99% Exceeds 95%.
Alert Recovery
- Template: Alert: ${alarm_name}, OCP Node CPU Usage Exceeds Threshold: ${value_shown}
- Example: Alert: OCP Node CPU Usage Exceeds Threshold, OCP Node CPU Usage Exceeds Threshold: 10 %
Impact on the system
When the CPU usage of an OCP node exceeds a certain limit, the system may take longer to process other normal requests, which may cause request failures and affect user experience.
Possible causes
- CPU utilization spikes due to OCP scale-in.
- Modified certain parameters in parameter management, causing OCP monitoring collection, alert detection, backup and recovery, cluster operations, and maintenance to consume excessive CPU.
Solution
- Log in to OCP, and choose System Management > Platform Monitoring from the left-side navigation pane to view the performance monitoring and HTTP request monitoring of the OCP platform. Observe whether related performance metrics such as memory, disk, and system load are normal.
- On the OCP host, run the
topcommand to check the CPU utilization. You can try to locate the issue at the machine or network level. - Add CPU resources for OCP. After restarting OCP, observe whether the issue is resolved.
