Alert description
OCP-Agent is a collection of service processes deployed on the host. ocp_monagent is responsible for collecting monitoring data of the host and OBServer nodes, ocp_mgragent is responsible for O&M operations on the OBServer nodes, and ocp_agentd is the guardian process of the two services. If the guardian process does not exist, OCP-Agent cannot continuously provide services.
This alert is responsible for monitoring whether the OCP-Agent process on the host is normal. If not, this alert is triggered.
Note
This alert can be reported when OCP-Agent is still providing monitoring collection services, meaning that the guardian process does not exist, but the process responsible for monitoring is still normal.
Alert principle
| Parameter | Value |
|---|---|
| Monitoring metric | host_agent_process_status The value of the host_agent_process_status metric indicates the process status. 1 indicates normal, and 0 indicates abnormal. |
| Source of the metric | The status of ocp_agentd, ocp_monagent, and ocp_mgragent processes can be obtained by requesting the ocp_mgragent status interface (http://ip:62888/api/v1/agent/status). |
| Metric to be collected | host_agent_process_status |
| Monitoring expression | min(host_agent_process_status{@LABELS}) by (@GBLABELS) |
| Collection interval | 30 seconds |
Rule information
| Monitoring metric | Default threshold | Duration | Detection interval | Elimination interval |
|---|---|---|---|---|
| host_agent_process_status | 0 | 0 seconds | 15 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Service interruption | Server |
Alert templates
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx Agent service unavailable
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: agent process unavailable. Process name: ${agent_process}, Process status: ${process_status}.
- Example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: agent process unavailable. Process name: ocp_monagent, Process status: unavailable.
Alert recovery
- Template: Alert: ${alarm_name}, Agent process status: ${process_status}
- Example: Alert: Agent service unavailable, Agent process status: 1
Impact on the system
- If the ocp_agentd process does not exist, the OCP-Agent process may stop unexpectedly and will not be automatically started, which can cause the following issues:
The ocp_monagent process may fail to collect monitoring data, leading to an inability to report alerts and preventing users from identifying risks in the system.
The ocp_mgragent process may fail to perform maintenance operations on OBServer nodes.
Possible causes
- The ocp_agentd process may be unexpectedly terminated due to reasons such as full disk space or insufficient memory.
- A bug in the program may cause the ocp_agentd process to fail to start multiple times and then stop attempting to start it again.
Solutions
Check if the remaining disk space or memory is insufficient.
Log in to the alert host and run the following commands to check the disk and memory usage.
# Check the disk usage, particularly for the disk where /home/admin is located, to see if it is approaching 100%. df -B1 # Check the memory usage to see if the remaining memory is approaching 0. free -gIf the disk usage is approaching 100%, clean up logs or expand the disk space.
Check the directories in /home/admin/logs/ that are taking up the most disk space.
[root]# du /home/admin/logs/ 4 /home/admin/logs/obproxy/minidump 32 /home/admin/logs/obproxy/etc 1261772 /home/admin/logs/obproxy/log 1261812 /home/admin/logs/obproxy 1261816 /home/admin/logs/Enter the directory and delete logs that are more than a year old.
[root]# ll /home/admin/logs/obproxy/log [root]# cd /home/admin/logs/obproxy/log [root]# rm obproxy.67344.log.20210902*Restart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If the remaining memory is approaching 0, release the memory and then restart the process.
Release memory
[root]# sync [root]# echo 1 > /proc/sys/vm/drop_caches [root]# echo 0 > /proc/sys/vm/drop_cachesRestart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If both the remaining memory and disk space are sufficient, the issue may be due to other reasons.
For other reasons, collect the ocp_agentd.log, ocp_monagent.error.log, and ocp_mgragent.error.log files in /home/admin/ocp_agent/log, along with the alert details, and contact technical support to locate the issue.
The ocp_monagent.error.log file may contain the keyword "panic" which indicates a program crash. Search for this keyword and provide the relevant information to technical support.
[root]# tail -1000 /home/admin/ocp_agent/log/ocp_monagent.error.log | grep -10 panic