ocp_collect_metric_failure_rate_high OCP-Server collects metrics with a high failure rate|V4.3.6| docs|Distributed Database

ocp_collect_metric_failure_rate_high OCP-Server collects metrics with a high failure rate

Last Updated：2025-09-08 08:15:43 Updated

Alert description

OCP-Server collects the exporter of OCP-Agent. If the collection failure rate exceeds the threshold, this alert is triggered. Collection failure rate: number of failed collections divided by total number of collections.

Alert principle

Parameter	Value
Monitoring metric	collect_metric_failure_rate
Source of the metric	OCP self-monitoring. You can request `http://OCP-IP:8080/api/v2/actuator/prometheus` to view it.
Collected metric	ocp_monitor_collect_request_errors_total, ocp_monitor_collect_request_duration_ms_count
Monitoring expression	100 * sum(rate(ocp_monitor_collect_request_errors_total{app="OCP"}[60])) by (svr_ip) / sum(rate(ocp_monitor_collect_request_duration_ms_count{app="OCP"}[60])) by (svr_ip)
Collection cycle	60 seconds

Rule information

Monitoring expression	Meaning of the monitoring metric	Default threshold	Detection cycle	Elimination cycle
collect_metric_failure_rate > 10	Collection error rate of the exporter	10	15 seconds	5 minutes

Alert information

Alert triggering method	Alert level	Scope
Based on the expression of the monitoring metric	Severe	Service

Alert template

Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:app=OCP OCP-Server collection failure rate is too high
Alert details
- Template: Alert: ${alarm_name}. The collection failure rate is: ${value_shown}.
- Example: Alert: OCP-Server collection failure rate is too high. The collection failure rate is: 75.757 %.
Alert recovery
- Template: Alert: ${alarm_name}, collection failure rate: ${value_shown}
- Example: Alert: OCP-Server collection failure rate is too high, collection failure rate: 5 %

Impact on the system

Monitoring services such as GUI monitoring, alerting, diagnostics, and inspection may be affected, including:

The GUI monitoring is interrupted or no monitoring data is displayed.
The alert cannot be triggered when an exception occurs.
The diagnostic report and inspection report cannot locate the cause of the exception.

Possible causes

The resources of the MetaDB tenant of OCP are insufficient, which causes frequent full GC in OCP. You can refer to the documentation for troubleshooting and appropriately scale out the resources. For more information, see MonitorDB resources.
The ocp_monagent process of OCP-Agent is abnormal, which causes the access to its exporter (API) to be abnormal. You can check the error information in /home/admin/ocp_agent/log/monagent.log for further confirmation.
Check the OCP-Server log (ocp.log) to locate the error cause. For example, if the network is unstable, an access timeout error may be reported (if the API does not respond within 1 second, it will be considered a timeout error).

The data collected by OCP-Agent is excessive, which causes the API to respond with a timeout error. You can run the command on the host without monitoring to view the amount of monitoring data returned. If the amount is excessive (more than 10,000 rows), you need to upgrade OCP to V4.2.0 or later.

sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/basic | wc -l
sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/extra | wc -l

Resolution

If the issue is due to insufficient resources, you need to scale out the ocp_monitor and ocp_meta tenants.
If the issue is due to OCP-Agent, it is often accompanied by other alerts such as monitor_exporter_unavaliable, monagent_log_alarm, obagent_dead, and host_unavailable. You can view the suggested solutions for these alerts and, in emergencies, try restarting or reinstalling OCP-Agent.
Upgrade the OCP-Agent version. In OCP V4.2.1, zero-value data in diagnostic information is filtered out. This data may account for a large proportion and significantly impact collection performance.