monagent_process_stop

2025-01-10 06:15:54  Updated

Description

This alert is triggered when the ocp_monagent process stops.

The ocp_agent process monitors the uptime of the processes of components in OceanBase Database. When the uptime of a process is 0, the process is stopped. This alert indicates that the process has stopped at the moment rather than for a long time. In the latter case, other status alerts are triggered.

Relevant alerts:

  • observer process: observer_process_stop
  • obproxy process: obproxy_process_stop
  • obproxyd.sh process: obproxyd_process_stop
  • ocp_agentd process: agentd_process_stop
  • ocp_mgragent process: mgragent_process_stop
  • ocp_monagent process: monagent_process_stop

Principle

Parameter Value
Metric monagent_uptime_delta_seconds
Source The uptime is the difference between the current time and the process creation time displayed in the 14th column of the stat file in the /proc/[pid] directory.
Collected metric process_uptime_seconds
Metric expression 0 - min(delta(process_uptime_seconds{name="ocp_monagent",@LABELS}[@INTERVAL])) by (@GBLABELS)
Collection cycle 5 seconds

Alert rule

Metric expression Metric description Default threshold Detection cycle Time before clearance
monagent_uptime_delta_seconds > 0 The negative offset value of uptime is used. A metric value greater than 0 indicates that the process has stopped. 0 seconds 10 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Warning Host

Alert templates

  • Overview

    • Template: ${alarm_target} ${alarm_name}
    • Example: alarm_template_id=0:ob_cluster=AdminMETA-12:host=xxx.xxx.xxx.xxx. The ocp_monagent process stopped.
  • Details

    • Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}. The process has been running for ${value_shown}.
    • Example: Cluster: AdminMETA. Host: xxx.xxx.xxx.xxx. Alert: The ocp_monagent process stopped. The process has been running for 34 minutes 8 seconds.

Impact on the system

OCP-Agent uses the ocp_monagent process to collect monitoring data.

  1. If the ocp_monagent process is restarted by the ocp_agentd process soon after it stops, the impact on the OCP service is negligible. The ocp_agentd process can try at most three times to start the ocp_monagent process. Monitoring data for a few seconds may be lost, and the monitoring data cached by the process will be lost. You can rebuild the cache by querying internal tables of OceanBase Database. If the process restarts frequently, the cache implementation will lead to cache penetration, which increases the workload of OBServer nodes. Usually, the monitoring data is collected at a rate less than 20 QPS.

  2. If the ocp_monagent process is not restarted after it stops, a large amount of monitoring data will be lost, and alerts that depend on the monitoring data become invalid. However, alerts that do not depend on the monitoring data are still active. In this case, OBServer nodes cannot retain the monitoring history.

Possible causes

  • If the ocp_monagent process consumes more than 2 GB of memory, it is restarted by the ocp_agentd process. This may be caused by the following reasons:

    • Monitoring data is collected at a high rate, which occupies large memory space.
    • The process memory leaks.
  • A process bug causes an execution error and the process unexpectedly exits.

  • A monitoring configuration error, such as a YAML syntax error of a custom configuration in the module_config file stored in the /home/admin/ocp_agent/conf/ directory.

  • An expected O&M action. In this case, you can restart the OCP-Agent in the OCP console.

Solutions

  1. If the memory usage of the process exceeds the threshold, you can temporarily increase the threshold and then start the process. To do so, open the agentd.yaml file in the /home/admin/ocp_agent/conf/ directory, and set the value of the ocp_monagent.limit.memoryQuota parameter to, for example, 2.5 GB.

  2. If the process exits due to a bug, you can search the ocp_monagent.error.log file for the panic keyword to check the bug. In this case, if the ocp_agentd process fails to restart the ocp_monagent process, you can manually restart it.

    /home/admin/ocp_agentctl service start ocp_monagent
    
  3. If the process fails to restart due to a configuration error, you can correct the custom configuration and then start the process.

Contact Us