Alert description
OCP-Server collects the exporter of OCP-Agent. If the collection failure rate exceeds the threshold, this alert is triggered. Collection failure rate: number of failed collections divided by total number of collections.
Alert principle
| Parameter | Value |
|---|---|
| Monitoring metric | collect_metric_failure_rate |
| Source of the metric | OCP self-monitoring. You can request http://OCP-IP:8080/api/v2/actuator/prometheus to view it. |
| Collected metric | ocp_monitor_collect_request_errors_total, ocp_monitor_collect_request_duration_ms_count |
| Monitoring expression | 100 * sum(rate(ocp_monitor_collect_request_errors_total{app="OCP"}[60])) by (svr_ip) / sum(rate(ocp_monitor_collect_request_duration_ms_count{app="OCP"}[60])) by (svr_ip) |
| Collection cycle | 60 seconds |
Rule information
| Monitoring expression | Meaning of the monitoring metric | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| collect_metric_failure_rate > 10 | Collection error rate of the exporter | 10 | 15 seconds | 5 minutes |
Alert information
| Alert triggering method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Severe | Service |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:app=OCP OCP-Server collection failure rate is too high
Alert details
- Template: Alert: ${alarm_name}. The collection failure rate is: ${value_shown}.
- Example: Alert: OCP-Server collection failure rate is too high. The collection failure rate is: 75.757 %.
Alert recovery
- Template: Alert: ${alarm_name}, collection failure rate: ${value_shown}
- Example: Alert: OCP-Server collection failure rate is too high, collection failure rate: 5 %
Impact on the system
Monitoring services such as GUI monitoring, alerting, diagnostics, and inspection may be affected, including:
- The GUI monitoring is interrupted or no monitoring data is displayed.
- The alert cannot be triggered when an exception occurs.
- The diagnostic report and inspection report cannot locate the cause of the exception.
Possible causes
The resources of the MetaDB tenant of OCP are insufficient, which causes frequent full GC in OCP. You can refer to the documentation for troubleshooting and appropriately scale out the resources. For more information, see MonitorDB resources.
The ocp_monagent process of OCP-Agent is abnormal, which causes the access to its exporter (API) to be abnormal. You can check the error information in
/home/admin/ocp_agent/log/monagent.logfor further confirmation.Check the OCP-Server log (ocp.log) to locate the error cause. For example, if the network is unstable, an access timeout error may be reported (if the API does not respond within 1 second, it will be considered a timeout error).
The data collected by OCP-Agent is excessive, which causes the API to respond with a timeout error. You can run the command on the host without monitoring to view the amount of monitoring data returned. If the amount is excessive (more than 10,000 rows), you need to upgrade OCP to V4.2.0 or later.
sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/basic | wc -l sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/extra | wc -l
Resolution
- If the issue is due to insufficient resources, you need to scale out the ocp_monitor and ocp_meta tenants.
- If the issue is due to OCP-Agent, it is often accompanied by other alerts such as monitor_exporter_unavaliable, monagent_log_alarm, obagent_dead, and host_unavailable. You can view the suggested solutions for these alerts and, in emergencies, try restarting or reinstalling OCP-Agent.
- Upgrade the OCP-Agent version. In OCP V4.2.1, zero-value data in diagnostic information is filtered out. This data may account for a large proportion and significantly impact collection performance.