obagent_dead

2024-03-22 09:34:34  Updated

Description

OCP-Agent is a collection of service programs installed on the host: ocp_monagent collects monitoring data from the hosts and OBServer nodes, ocp_mgragent performs O&M operations on OBServer nodes, and ocp_agentd is the daemon of the two services. If the daemon does not exist, OCP-Agent cannot work properly.

This alert is triggered when an OCP-Agent process on the host is abnormal.

Note

This alert is generated when the daemon does not exist but the monitoring process of OCP-Agent works normally.

Principle

Parameter Value
Metric host_agent_process_status
The value of host_agent_process_status indicates the process status. The value 1 indicates normal and 0 indicates abnormal.
Source The ocp_mgragent status API http://ip:62888/api/v1/agent/status is called to obtain the status of the ocp_agentd, ocp_monagent, and ocp_mgragent processes.
Collected metric host_agent_process_status
Metric expression min(host_agent_process_status{@LABELS}) by (@GBLABELS)
Collection cycle 30s

Alert rule

Metric Default threshold Duration Detection cycle Time before clearance
host_agent_process_status 0 0 seconds 15 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Stopped Server

Alert templates

  • Overview template: ${alarm_target} ${alarm_name}

  • Details Template: Cluster: ${ob_cluster_name}; Host: ${host}; Alert: OCP-Agent process unavailable; Process name: ${agent_process}; Process status: ${process_status}.

  • Overview example: svr_ip=xxx.xxx.xxx.xxx OCP-Agent process unavailable

  • Details example: Cluster: obcluster-1; Host: xxx.xxx.xxx.xxx; Alert: OCP-Agent process unavailable; Process name: ocp_monagent; Process status: unavailable.

Impact on the system

  1. If the ocp_agentd process does not exist, the processes related to OCP-Agent may stop and cannot automatically start, which causes the following consequences:
  • If the ocp_monagent process is abnormal, monitoring data cannot be collected and alerts cannot be generated. As a result, you cannot identify system risks in a timely manner.

  • If the ocp_monagent process is abnormal, you cannot perform O&M on OBServer nodes.

Possible causes

  1. The ocp_agentd process may be unexpectedly terminated because the host memory or disk is full.
  2. The program has a bug. When ocp_agentd fails several times to start a process, it stops trying.

Solutions

  1. Check the usage of the disk or memory.

    Log on to the alerting host and run the following commands:

    # Check whether the usage of the disk under /home/admin is close to 100%. 
    df -B1
    
    # Check whether the available memory space is close to 0. 
    free -g
    
    • If the disk usage is close to 100%, perform the following steps to clear logs or add more disks.

      1. View large sub-directories in the /home/admin/logs directory.

        [root]#  du /home/admin/logs/
        4       /home/admin/logs/obproxy/minidump
        32      /home/admin/logs/obproxy/etc
        1261772 /home/admin/logs/obproxy/log
        1261812 /home/admin/logs/obproxy
        1261816 /home/admin/logs/
        
      2. Enter the log directory and delete obsolete logs.

        [root]# ll /home/admin/logs/obproxy/log
        [root]# cd /home/admin/logs/obproxy/log
        [root]# rm obproxy.67344.log.20210902*
        
      3. Restart the ocp_agentd process.

        [root]# cd /home/admin/ocp_agent
        [root]# python ocp_agent_ctl.py recover
        
    • If the available memory space is close to 0, perform the following steps to release the memory, and restart the process.

      1. Release the memory space.

        [root]# sync
        [root]# echo 1 > /proc/sys/vm/drop_caches
        [root]# echo 0 > /proc/sys/vm/drop_caches
        
      2. Restart the ocp_agentd process.

        [root]# cd /home/admin/ocp_agent
        [root]# python ocp_agent_ctl.py recover
        
    • If the remaining memory and disk space are sufficient, proceed to the next step.

  2. Collect related logs in the ocp_agentd.log, ocp_monagent.error.log, and ocp_mgragent.error.log files in the file/home/admin/ocp_agent/log/ directory and related alert information, and send them to OceanBase Technical Support to locate the issue.

    • The ocp_monagent.error.log file may contain the keyword panic, which may cause the program to crash. You can search for the keyword and provide the logs to OceanBase Technical Support for troubleshooting.

      [root]# tail -1000 /home/admin/ocp_agent/log/ocp_monagent.error.log | grep -10 panic
      

Contact Us