Description
OCP-Server collects monitoring data from the exporter of OCP-Agent. This alert is triggered when the collection failure rate of OCP-Server exceeds the threshold. Collection failure rate = Number of failed collection attempts / Total number of collection attempts.
Principle
| Parameter | Value |
|---|---|
| Metric | collect_metric_failure_rate |
| Source | OceanBase Cloud Platform (OCP). You can check the failure rate at http://OCP-IP:8080/api/v2/actuator/prometheus. |
| Collected metric | ocp_monitor_collect_request_errors_total and ocp_monitor_collect_request_duration_ms_count |
| Metric expression | 100 * sum(rate(ocp_monitor_collect_request_errors_total{app="OCP"}[60])) by (svr_ip) / sum(rate(ocp_monitor_collect_request_duration_ms_count{app="OCP"}[60])) by (svr_ip) |
| Collection cycle | 60 seconds |
Alert rule
| Metric expression | Metric description | Default threshold | Detection cycle | Time before clearance |
|---|---|---|---|---|
| collect_metric_failure_rate > 10 | Exporter collection failure rate | 10 | 15 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Service |
Alert templates
Overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:app=OCP. The collection failure rate of OCP-Server is high.
Details
- Template: Alert: ${ alarm_name}. Monitoring data collection failure rate: ${value_shown}.
- Example: Alert: The collection failure rate of OCP-Server is high. Monitoring data collection failure rate: 75.757 %.
Impact on the system
The monitoring, alerting, diagnosis, and inspection features on the GUI may be affected.
- The monitoring statistics on the GUI may be intermittent or not displayed.
- No alerts are generated when an exception occurs.
- You cannot locate the causes for the exceptions in the diagnostic or inspection report.
Possible causes
The resources for the OCP_meta tenant are insufficient, which results in frequent full garbage collection (GC) exceptions of OCP. You can check the resource specification of the tenant and scale out the tenant. For more information, see MonitorDB resources.
The ocp_monagent process of OCP-Agent is abnormal, which causes an access exception of the exporter API. You can check the monagent.log file in the
/home/admin/ocp_agent/log/directory for information about exceptions.Check the ocp.log of OCP-Server for the causes. For example, a network issue may cause an access timeout. If the API fails to respond within 1 second, the API response times out.
OCP-Agent is collecting a large amount of data, which causes an API response timeout. You can run the following command on the host that lacks monitoring data to check the amount of collected monitoring data. If the amount of data is massive, for example, more than 10,000 rows, upgrade OCP to a version later than V4.2.0.
sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/basic | wc -l sudo curl -s --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$(cat /home/admin/ocp_agent/run/ocp_monagent.pid).sock http://unix-socket-server/metrics/ob/extra | wc -l
Suggested solutions
- If the alert is caused by resource insufficiency, scale out the ocp_monitor and ocp_meta tenants.
- If the alert is caused by OCP-Agent, other alerts, such as monitor_exporter_unavailable, monagent_log_alarm, obagent_dead, and host_unavailable are also reported at the same time. In this case, you can view the alert handling suggestions and take corresponding measures. In case of emergency, you can restart or reinstall OCP-Agent.
- Upgrade OCP-Agent. In OCP V4.2.1, zero values in the diagnostic data are filtered. Zero values may account for a large proportion of diagnostic data and have a significant impact on data collection performance.