Description
This alert is triggered when the ocp_monagent process stops.
The ocp_agent process monitors the uptime of the processes of components in OceanBase Database. When the uptime of a process is 0, the process is stopped. This alert indicates that the process has stopped at the moment rather than for a long time. In the latter case, other status alerts are triggered.
Relevant alerts:
- observer process: observer_process_stop
- obproxy process: obproxy_process_stop
- obproxyd.sh process: obproxyd_process_stop
- ocp_agentd process: agentd_process_stop
- ocp_mgragent process: mgragent_process_stop
- ocp_monagent process: monagent_process_stop
Principle
| Parameter | Value |
|---|---|
| Metric | monagent_uptime_delta_seconds |
| Source | The uptime is the difference between the current time and the process creation time displayed in the 14th column of the stat file in the /proc/[pid] directory. |
| Collected metric | process_uptime_seconds |
| Metric expression | 0 - min(delta(process_uptime_seconds{name="ocp_monagent",@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Collection cycle | 5 seconds |
Alert rule
| Metric expression | Metric description | Default threshold | Detection cycle | Time before clearance |
|---|---|---|---|---|
| monagent_uptime_delta_seconds > 0 | The negative offset value of uptime is used. A metric value greater than 0 indicates that the process has stopped. | 0 seconds | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Warning | Host |
Alert templates
Overview
- Template:
${alarm_target} ${alarm_name} - Example:
alarm_template_id=0:ob_cluster=AdminMETA-12:host=xxx.xxx.xxx.xxx. The ocp_monagent process stopped.
- Template:
Details
- Template:
Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}. The process has been running for ${value_shown}. - Example:
Cluster: AdminMETA. Host: xxx.xxx.xxx.xxx. Alert: The ocp_monagent process stopped. The process has been running for 34 minutes 8 seconds.
- Template:
Impact on the system
OCP-Agent uses the ocp_monagent process to collect monitoring data.
If the ocp_monagent process is restarted by the ocp_agentd process soon after it stops, the impact on the OCP service is negligible. The ocp_agentd process can try at most three times to start the ocp_monagent process. Monitoring data for a few seconds may be lost, and the monitoring data cached by the process will be lost. You can rebuild the cache by querying internal tables of OceanBase Database. If the process restarts frequently, the cache implementation will lead to cache penetration, which increases the workload of OBServer nodes. Usually, the monitoring data is collected at a rate less than 20 QPS.
If the ocp_monagent process is not restarted after it stops, a large amount of monitoring data will be lost, and alerts that depend on the monitoring data become invalid. However, alerts that do not depend on the monitoring data are still active. In this case, OBServer nodes cannot retain the monitoring history.
Possible causes
If the ocp_monagent process consumes more than 2 GB of memory, it is restarted by the ocp_agentd process. This may be caused by the following reasons:
- Monitoring data is collected at a high rate, which occupies large memory space.
- The process memory leaks.
A process bug causes an execution error and the process unexpectedly exits.
A monitoring configuration error, such as a YAML syntax error of a custom configuration in the
module_configfile stored in the/home/admin/ocp_agent/conf/directory.An expected O&M action. In this case, you can restart the OCP-Agent in the OCP console.
Solutions
If the memory usage of the process exceeds the threshold, you can temporarily increase the threshold and then start the process. To do so, open the
agentd.yamlfile in the/home/admin/ocp_agent/conf/directory, and set the value of theocp_monagent.limit.memoryQuotaparameter to, for example, 2.5 GB.If the process exits due to a bug, you can search the
ocp_monagent.error.logfile for thepanickeyword to check the bug. In this case, if the ocp_agentd process fails to restart the ocp_monagent process, you can manually restart it./home/admin/ocp_agentctl service start ocp_monagentIf the process fails to restart due to a configuration error, you can correct the custom configuration and then start the process.