ocp_collect_metric_failure_rate_high|V4.3.3| docs|Distributed Database

ocp_collect_metric_failure_rate_high

Last Updated：2025-01-10 06:15:54 Updated

Description

OCP-Server collects monitoring data from the exporter of OCP-Agent. This alert is triggered when the collection failure rate of OCP-Server exceeds the threshold. Collection failure rate = Number of failed collection attempts / Total number of collection attempts.

Principle

Parameter	Value
Metric	collect_metric_failure_rate
Source	OceanBase Cloud Platform (OCP). You can check the failure rate at `http://OCP-IP:8080/api/v2/actuator/prometheus`.
Collected metric	ocp_monitor_collect_request_errors_total and ocp_monitor_collect_request_duration_ms_count
Metric expression	100 * sum(rate(ocp_monitor_collect_request_errors_total{app="OCP"}[60])) by (svr_ip) / sum(rate(ocp_monitor_collect_request_duration_ms_count{app="OCP"}[60])) by (svr_ip)
Collection cycle	60 seconds

Alert rule

Metric expression	Metric description	Default threshold	Detection cycle	Time before clearance
collect_metric_failure_rate > 10	Exporter collection failure rate	10	15 seconds	5 minutes

Alert information

Trigger method	Alert level	Scope
Metric expression	Critical	Service

Alert templates

Overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:app=OCP. The collection failure rate of OCP-Server is high.
Details
- Template: Alert: ${ alarm_name}. Monitoring data collection failure rate: ${value_shown}.
- Example: Alert: The collection failure rate of OCP-Server is high. Monitoring data collection failure rate: 75.757 %.

Impact on the system

The monitoring, alerting, diagnosis, and inspection features on the GUI may be affected.

The monitoring statistics on the GUI may be intermittent or not displayed.
No alerts are generated when an exception occurs.
You cannot locate the causes for the exceptions in the diagnostic or inspection report.

Possible causes

The resources for the OCP_meta tenant are insufficient, which results in frequent full garbage collection (GC) exceptions of OCP. You can check the resource specification of the tenant and scale out the tenant. For more information, see MonitorDB resources.
The ocp_monagent process of OCP-Agent is abnormal, which causes an access exception of the exporter API. You can check the monagent.log file in the /home/admin/ocp_agent/log/ directory for information about exceptions.
Check the ocp.log of OCP-Server for the causes. For example, a network issue may cause an access timeout. If the API fails to respond within 1 second, the API response times out.

OCP-Agent is collecting a large amount of data, which causes an API response timeout. You can run the following command on the host that lacks monitoring data to check the amount of collected monitoring data. If the amount of data is massive, for example, more than 10,000 rows, upgrade OCP to a version later than V4.2.0.

sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/basic | wc -l
sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/extra | wc -l