Description
When the average number of errors per second of an OCP HTTP URI exceeds the threshold, this alert is triggered.
Principle
| Parameter | Value |
|---|---|
| Metric | ocp_http_request_error_count |
| Source | OCP server tracking data |
| Collected metric | ocp_http_request_error_count |
| Metric expression | sum(rate(http_server_requests_seconds_count{status=~"4..|5..",@LABELS}[@INTERVAL])) by (@GBLABELS)\ |
| Collection cycle | 60 seconds |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Warning | OCP node |
Alert rule
| Metric | Default threshold | Source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| ocp_http_request_error_count | 1 | OCP server tracking data | 60 seconds | 5 minutes |
Alert template
- Alert overview
- Template: ${alarm_target} ${alarm_name}
- Sample: alarm_template_id=0:svr_ip=xxx.xxx.xxx.xxx:status=500:uri=root:method=GET ocp_http_request_too_many_errors_occur
- Alert details
- Template: [${alarm_name}] OCP server: ${svr_ip}, an average of ${value_shown} HTTP requests failed per second, uri=${uri}, method=${method}
- Sample: [ocp_http_request_too_many_errors_occur] OCP server: xxx.xxx.xxx.xxx, an average of 1.2 HTTP requests failed per second, uri=root, method=GET
Impact on the system
If too many errors occur on an OCP URI, this URI may have been disabled.
Possible causes
- The network is unstable.
- The system status is abnormal, such as insufficient memory or full CPU utilization.
- An external request attack is initiated.
- A call error occurs in internal requests.
Solutions
Based on the request URI in the alert details, check whether too many call errors have occurred due to configuration errors. If yes, correct the configuration errors.
Manual operations will not frequently cause so many error requests. If this alert is triggered, provide related logs and alert records to the administrator for troubleshooting.