obagent_dead Agent service unavailable|V4.3.6| docs|Distributed Database

obagent_dead Agent service unavailable

Last Updated：2025-09-08 08:15:43 Updated

Alert description

OCP-Agent is a collection of service processes deployed on the host. ocp_monagent is responsible for collecting monitoring data of the host and OBServer nodes, ocp_mgragent is responsible for O&M operations on the OBServer nodes, and ocp_agentd is the guardian process of the two services. If the guardian process does not exist, OCP-Agent cannot continuously provide services.

This alert is responsible for monitoring whether the OCP-Agent process on the host is normal. If not, this alert is triggered.

Note

This alert can be reported when OCP-Agent is still providing monitoring collection services, meaning that the guardian process does not exist, but the process responsible for monitoring is still normal.

Alert principle

Parameter	Value
Monitoring metric	host_agent_process_status The value of the host_agent_process_status metric indicates the process status. 1 indicates normal, and 0 indicates abnormal.
Source of the metric	The status of ocp_agentd, ocp_monagent, and ocp_mgragent processes can be obtained by requesting the ocp_mgragent status interface (`http://ip:62888/api/v1/agent/status`).
Metric to be collected	host_agent_process_status
Monitoring expression	min(host_agent_process_status{@LABELS}) by (@GBLABELS)
Collection interval	30 seconds

Rule information

Monitoring metric	Default threshold	Duration	Detection interval	Elimination interval
host_agent_process_status	0	0 seconds	15 seconds	5 minutes

Alert information

Alert trigger method	Alert level	Scope
Based on the expression of the monitoring metric	Service interruption	Server

Alert templates

Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx Agent service unavailable
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: agent process unavailable. Process name: ${agent_process}, Process status: ${process_status}.
- Example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: agent process unavailable. Process name: ocp_monagent, Process status: unavailable.
Alert recovery
- Template: Alert: ${alarm_name}, Agent process status: ${process_status}
- Example: Alert: Agent service unavailable, Agent process status: 1

Impact on the system

If the ocp_agentd process does not exist, the OCP-Agent process may stop unexpectedly and will not be automatically started, which can cause the following issues:

The ocp_monagent process may fail to collect monitoring data, leading to an inability to report alerts and preventing users from identifying risks in the system.
The ocp_mgragent process may fail to perform maintenance operations on OBServer nodes.

Possible causes

The ocp_agentd process may be unexpectedly terminated due to reasons such as full disk space or insufficient memory.
A bug in the program may cause the ocp_agentd process to fail to start multiple times and then stop attempting to start it again.

Solutions

Check if the remaining disk space or memory is insufficient.

# Check the disk usage, particularly for the disk where /home/admin is located, to see if it is approaching 100%.
df -B1

# Check the memory usage to see if the remaining memory is approaching 0.
free -g

If the disk usage is approaching 100%, clean up logs or expand the disk space.

Check the directories in /home/admin/logs/ that are taking up the most disk space.

[root]#  du /home/admin/logs/
4       /home/admin/logs/obproxy/minidump
32      /home/admin/logs/obproxy/etc
1261772 /home/admin/logs/obproxy/log
1261812 /home/admin/logs/obproxy
1261816 /home/admin/logs/

Enter the directory and delete logs that are more than a year old.

[root]# ll /home/admin/logs/obproxy/log
[root]# cd /home/admin/logs/obproxy/log
[root]# rm obproxy.67344.log.20210902*

Restart the ocp_agentd process.

[root]# cd /home/admin/ocp_agent
[root]# python ocp_agent_ctl.py recover

If the remaining memory is approaching 0, release the memory and then restart the process.

Release memory

[root]# sync
[root]# echo 1 > /proc/sys/vm/drop_caches
[root]# echo 0 > /proc/sys/vm/drop_caches

Restart the ocp_agentd process.

[root]# cd /home/admin/ocp_agent
[root]# python ocp_agent_ctl.py recover

If both the remaining memory and disk space are sufficient, the issue may be due to other reasons.

For other reasons, collect the ocp_agentd.log, ocp_monagent.error.log, and ocp_mgragent.error.log files in /home/admin/ocp_agent/log, along with the alert details, and contact technical support to locate the issue.
- The ocp_monagent.error.log file may contain the keyword "panic" which indicates a program crash. Search for this keyword and provide the relevant information to technical support.
```
[root]# tail -1000 /home/admin/ocp_agent/log/ocp_monagent.error.log | grep -10 panic
```