monagent_process_stop|V4.3.3| docs|Distributed Database

monagent_process_stop

Last Updated：2025-01-10 06:15:54 Updated

Description

This alert is triggered when the ocp_monagent process stops.

The ocp_agent process monitors the uptime of the processes of components in OceanBase Database. When the uptime of a process is 0, the process is stopped. This alert indicates that the process has stopped at the moment rather than for a long time. In the latter case, other status alerts are triggered.

Relevant alerts:

observer process: observer_process_stop
obproxy process: obproxy_process_stop
obproxyd.sh process: obproxyd_process_stop
ocp_agentd process: agentd_process_stop
ocp_mgragent process: mgragent_process_stop
ocp_monagent process: monagent_process_stop

Principle

Parameter	Value
Metric	monagent_uptime_delta_seconds
Source	The uptime is the difference between the current time and the process creation time displayed in the 14th column of the `stat` file in the `/proc/[pid]` directory.
Collected metric	process_uptime_seconds
Metric expression	0 - min(delta(process_uptime_seconds{name="ocp_monagent",@LABELS}[@INTERVAL])) by (@GBLABELS)
Collection cycle	5 seconds

Alert rule

Metric expression	Metric description	Default threshold	Detection cycle	Time before clearance
monagent_uptime_delta_seconds > 0	The negative offset value of uptime is used. A metric value greater than 0 indicates that the process has stopped.	0 seconds	10 seconds	5 minutes

Alert information

Trigger method	Alert level	Scope
Based on the expression of the metric	Warning	Host

Alert templates

Overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=AdminMETA-12:host=xxx.xxx.xxx.xxx. The ocp_monagent process stopped.
Details
- Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}. The process has been running for ${value_shown}.
- Example: Cluster: AdminMETA. Host: xxx.xxx.xxx.xxx. Alert: The ocp_monagent process stopped. The process has been running for 34 minutes 8 seconds.

Impact on the system

OCP-Agent uses the ocp_monagent process to collect monitoring data.

If the ocp_monagent process is restarted by the ocp_agentd process soon after it stops, the impact on the OCP service is negligible. The ocp_agentd process can try at most three times to start the ocp_monagent process. Monitoring data for a few seconds may be lost, and the monitoring data cached by the process will be lost. You can rebuild the cache by querying internal tables of OceanBase Database. If the process restarts frequently, the cache implementation will lead to cache penetration, which increases the workload of OBServer nodes. Usually, the monitoring data is collected at a rate less than 20 QPS.
If the ocp_monagent process is not restarted after it stops, a large amount of monitoring data will be lost, and alerts that depend on the monitoring data become invalid. However, alerts that do not depend on the monitoring data are still active. In this case, OBServer nodes cannot retain the monitoring history.

Possible causes

If the ocp_monagent process consumes more than 2 GB of memory, it is restarted by the ocp_agentd process. This may be caused by the following reasons:
- Monitoring data is collected at a high rate, which occupies large memory space.
- The process memory leaks.
A process bug causes an execution error and the process unexpectedly exits.
A monitoring configuration error, such as a YAML syntax error of a custom configuration in the module_config file stored in the /home/admin/ocp_agent/conf/ directory.
An expected O&M action. In this case, you can restart the OCP-Agent in the OCP console.

Solutions

If the memory usage of the process exceeds the threshold, you can temporarily increase the threshold and then start the process. To do so, open the agentd.yaml file in the /home/admin/ocp_agent/conf/ directory, and set the value of the ocp_monagent.limit.memoryQuota parameter to, for example, 2.5 GB.
If the process exits due to a bug, you can search the ocp_monagent.error.log file for the panic keyword to check the bug. In this case, if the ocp_agentd process fails to restart the ocp_monagent process, you can manually restart it.
```
/home/admin/ocp_agentctl service start ocp_monagent
```
If the process fails to restart due to a configuration error, you can correct the custom configuration and then start the process.