Description
OCP-Agent is a collection of service programs installed on the host: ocp_monagent collects monitoring data from the hosts and OBServer nodes, ocp_mgragent performs O&M operations on OBServer nodes, and ocp_agentd is the daemon of the two services. If the daemon does not exist, OCP-Agent cannot work properly.
This alert is triggered when an OCP-Agent process on the host is abnormal.
Note
This alert is generated when the daemon does not exist but the monitoring process of OCP-Agent works normally.
Principle
| Parameter | Value |
|---|---|
| Metric | host_agent_process_status The value of host_agent_process_status indicates the process status. The value 1 indicates normal and 0 indicates abnormal. |
| Source | The ocp_mgragent status API http://ip:62888/api/v1/agent/status is called to obtain the status of the ocp_agentd, ocp_monagent, and ocp_mgragent processes. |
| Collected metric | host_agent_process_status |
| Metric expression | min(host_agent_process_status{@LABELS}) by (@GBLABELS) |
| Collection cycle | 30s |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| host_agent_process_status | 0 | 0 seconds | 15 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Stopped | Server |
Alert templates
Overview template: ${alarm_target} ${alarm_name}
Details Template: Cluster: ${ob_cluster_name}; Host: ${host}; Alert: OCP-Agent process unavailable; Process name: ${agent_process}; Process status: ${process_status}.
Overview example: svr_ip=xxx.xxx.xxx.xxx OCP-Agent process unavailable
Details example: Cluster: obcluster-1; Host: xxx.xxx.xxx.xxx; Alert: OCP-Agent process unavailable; Process name: ocp_monagent; Process status: unavailable.
Impact on the system
- If the ocp_agentd process does not exist, the processes related to OCP-Agent may stop and cannot automatically start, which causes the following consequences:
If the ocp_monagent process is abnormal, monitoring data cannot be collected and alerts cannot be generated. As a result, you cannot identify system risks in a timely manner.
If the ocp_monagent process is abnormal, you cannot perform O&M on OBServer nodes.
Possible causes
- The ocp_agentd process may be unexpectedly terminated because the host memory or disk is full.
- The program has a bug. When ocp_agentd fails several times to start a process, it stops trying.
Solutions
Check the usage of the disk or memory.
Log on to the alerting host and run the following commands:
# Check whether the usage of the disk under /home/admin is close to 100%. df -B1 # Check whether the available memory space is close to 0. free -gIf the disk usage is close to 100%, perform the following steps to clear logs or add more disks.
View large sub-directories in the
/home/admin/logsdirectory.[root]# du /home/admin/logs/ 4 /home/admin/logs/obproxy/minidump 32 /home/admin/logs/obproxy/etc 1261772 /home/admin/logs/obproxy/log 1261812 /home/admin/logs/obproxy 1261816 /home/admin/logs/Enter the log directory and delete obsolete logs.
[root]# ll /home/admin/logs/obproxy/log [root]# cd /home/admin/logs/obproxy/log [root]# rm obproxy.67344.log.20210902*Restart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If the available memory space is close to 0, perform the following steps to release the memory, and restart the process.
Release the memory space.
[root]# sync [root]# echo 1 > /proc/sys/vm/drop_caches [root]# echo 0 > /proc/sys/vm/drop_cachesRestart the ocp_agentd process.
[root]# cd /home/admin/ocp_agent [root]# python ocp_agent_ctl.py recover
If the remaining memory and disk space are sufficient, proceed to the next step.
Collect related logs in the
ocp_agentd.log,ocp_monagent.error.log, andocp_mgragent.error.logfiles in thefile/home/admin/ocp_agent/log/directory and related alert information, and send them to OceanBase Technical Support to locate the issue.The
ocp_monagent.error.logfile may contain the keywordpanic, which may cause the program to crash. You can search for the keyword and provide the logs to OceanBase Technical Support for troubleshooting.[root]# tail -1000 /home/admin/ocp_agent/log/ocp_monagent.error.log | grep -10 panic