Description
OCP-Agent is a service program installed on the host. It collects monitoring data from the host and OBServers and performs O&M on OBServers. The daemon process of OCP-Agent is ocp_agentd. If the daemon does not exist, OCP-Agent cannot work properly.
This alert is triggered when OCP-Agent detects that the ocp_agentd process does not exist on the host.
Note
This alert is generated when the daemon does not exist but the monitoring process of OCP-Agent works normally.
Principle
| Parameter | Value |
|---|---|
| Metric | ocp_agentd_exists Note: The value of the ocp_agentd_exists metric indicates whether the ocp_agentd process exists. The value 1 indicates that the process exists and the value 0 indicates otherwise. |
| Source | ps -ef|grep -w ocp_agentd|grep -v grep|wc -l |
| Collected metric | ocp_agentd_exists |
| Metric expression | min(ocp_agentd_exists{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 second |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| ocp_agentd_exists | 0 | 0 seconds | 10s | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Stopped | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}
Overview example: svr_ip=xxx.xxx.xxx.xxx. Agent service is unavailable.
Details example: svr_ip=xxx.xxx.xxx.xxx. Agent service is unavailable.
Impact on the system
If the ocp_agentd process does not exist, the processes related to OCP-Agent may stop and cannot automatically start, which causes the following consequences:
If the monitoring data cannot be collected and the alert cannot be generated, you cannot identify system risks.
You cannot perform O&M on OBServers.
Possible cause
The ocp_agentd process may be unexpectedly terminated because the host memory or disk is full.
Suggested solutions
Check the usage of the disk or memory.
Log on to the alerting host and run the following commands:
# Check whether the usage of the disk under /home/admin is close to 100%. df -B1 # Check whether the available memory space is close to 0. free -gIf the disk usage is close to 100%, perform the following steps to clear logs or add more disks.
View large sub-directories in the /home/admin/logs/ directory.
[root]# du /home/admin/logs/ 4 /home/admin/logs/obproxy/minidump 32 /home/admin/logs/obproxy/etc 1261772 /home/admin/logs/obproxy/log 1261812 /home/admin/logs/obproxy 1261816 /home/admin/logs/Enter the log directory and delete obsolete logs.
[root]# ll /home/admin/logs/obproxy/log [root]# cd /home/admin/logs/obproxy/log [root]# rm obproxy.67344.log.20210902*Restart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If the available memory space is close to 0, perform the following steps to release the memory, and restart the process.
Release the memory space.
[root]# sync [root]# echo 1 > /proc/sys/vm/drop_caches [root]# echo 0 > /proc/sys/vm/drop_cachesRestart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If the remaining memory and disk space are sufficient, proceed to the next step.
When the ocp_agentd is terminated due to other reasons, find the ocp_agentd.log file in the /home/admin/ocp_agent/log/ directory and send it with the alert details to OCP technical support to locate the issue.