Alert description
Monitors whether the ocp_mgragent process has stopped.
The OCP-Agent process monitors the start time of each OceanBase component process. If the start time changes, it indicates that the process has stopped or restarted. This alert only indicates that the process has stopped and does not mean that the process is running for a long time. If the process is stopped for a long time, another status-related alert will be triggered.
The relevant processes include:
- observer: alert item is observer_process_stop
- obproxy: alert item is obproxy_process_stop
- obproxyd.sh: alert item is obproxyd_process_stop
- ocp_agentd: alert item is agentd_process_stop
- ocp_mgragent: alert item is mgragent_process_stop
- ocp_monagent: alert item is monagent_process_stop
Alert principle
| Parameter | Value |
|---|---|
| Monitoring metric | mgragent_boot_time_delta_seconds |
| Source of the metric | The boot time of the system plus the time difference between the process and the time the system was restarted is the start time of the process. The boot time of the system is equal to the btime value returned by the command cat /proc/stat. The time difference between the process and the system restart is equal to the value in the 22nd column of the result of the command cat /proc/pid/stat, divided by 100. |
| Sampling metric | process_boot_time_seconds |
| Monitoring expression | max(delta(process_boot_time_seconds{name="ocp_mgragent",@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Sampling interval | 5 seconds |
Rule information
| Monitoring expression | Description of the monitoring metric | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| mgragent_boot_time_delta_seconds > 0 | If the monitoring metric is greater than 0, it indicates that the process has stopped. | 0 seconds | 10 seconds | 5 minutes |
Alert information
| Alert triggering method | Alert level | Scope |
|---|---|---|
| Monitoring expression | Warning | Host |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=AdminMETA-12:host=xxx.xxx.xxx.xxx ocp_mgragent process stopped
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}.
- Example: Cluster: AdminMETA, Host: xxx.xxx.xxx.xxx, Alert: ocp_mgragent process stopped.
Alert recovery
- Template: Alert: ${alarm_name}, ocp_mgragent process start time change: ${recover_value}
- Example: Alert: ocp_mgragent process stopped, ocp_mgragent process start time change: 0 seconds
Impact on the system
OCP-Agent is an OCP-Agent process that manages the OBServer and OBProxy. If the ocp_mgragent process stops, the following issues may occur:
Ongoing maintenance operations will be terminated. These operations may be represented as maintenance tasks in the OCP-Server. The maintenance tasks will ultimately fail and can be retried.
If the process is not restarted by the guardian process (ocp_agentd), new maintenance tasks cannot be executed. In this case, the host_unavailable alert will be triggered.
Possible causes
- Process bugs causing execution errors and unexpected exits.
- Monitoring configuration errors, such as syntax (YAML) errors in custom configurations in
/home/admin/ocp_agent/conf/module_config. - Scheduled maintenance, such as GUI-based agent restarts.
Solution
The process BUG will record logs in
ocp_mgragent.error.log, and you can search for the panic keyword to confirm. At this point, you can try to start the process (the daemon process will attempt to pull it up, and if it fails, you need to manually pull it up)./home/admin/ocp_agentctl service start ocp_mgragentIf the process restart fails due to configuration errors, you can try to correct the custom configuration and then start the process.