host_agent_open_fd_count_over_threshold The number of open file descriptors for the server agent exceeds the threshold.

2025-09-08 08:15:43  Updated

Alert description

This alert is triggered when the number of file handles of the Agent exceeds the threshold.

The Agent service consists of two parts: the monitoring Agent (ocp_monagent) and the O&M Agent (ocp_mgragent). It is an important tool for managing and monitoring OceanBase Database.

Alerting principle

Parameter Value
Monitoring metric host_agent_open_fd_count indicates the number of file handles of the Agent process.
Metric source It is collected from the process itself through process monitoring provided by Prometheus. The process monitoring endpoint is as follows:
  • http://localhost:62888/metrics/stat for the O&M Agent.
  • http://localhost:62889/metrics/stat for the Monitoring Agent.
  • Collected metric process_open_fds
    Monitoring expression max(process_open_fds{@LABELS}) by (@GBLABELS)
    Collection interval 1 minute

    Alert information

    Alert trigger method Alert level Scope
    Expression based on monitoring metrics Warning Server

    Rule information

    Monitoring metric Default threshold Monitoring metric source Detection cycle Elimination cycle
    host_agent_open_fd_count 1000 Self-monitoring of the process 60 seconds 5 minutes

    Alert template

    • Alert overview

      • Template: ${alarm_target} ${alarm_name}
      • Example: svr_ip=xxx.xxx.xxx.xxx:process=ocp_monagent The number of open file handles for the server Agent has exceeded the limit.
    • Alert details

      • Template: Server: cluster: ${ob_cluster_name}, host: ${host}, alert: Agent process: ${process}, the number of open file handles ${value} has exceeded the limit of ${alarm_threshold}.
      • Example: cluster: obcluster-1, host: xxx.xxx.xxx.xxx, alert: Agent process: ocp_monagent, the number of open file handles 1200 has exceeded the limit of 1000.
    • Alert recovery

      • Template: Alert: ${alarm_name}, server Agent file handle count: ${value}
      • Example: Alert: The number of open file handles for the server Agent has exceeded the limit, server Agent file handle count: 950

    Impact on the system

    The Agent process is an important tool for OCP operations and monitoring of OceanBase Database. Its stability is crucial. The number of open file handles is an important indicator of process stability. If the number of open file handles is continuously increasing, there may be a leakage problem in the system.

    Possible causes

    1. There may be issues with timely resource closure in the monitoring Agent's collection tasks, such as in database read/write scenarios, log file read/write scenarios, and configuration file read/write scenarios.

    2. The O&M Agent processes and tracks OceanBase Database logs, which may lead to potential resource issues.

    Resolution

    When an alert is triggered, check the alert details to confirm the memory usage or number of open file handles by the Agent.

    • If the memory usage is excessively high (exceeding 10 GB) or the number of open file handles exceeds the system threshold (65,535), immediately restart the Agent process to prevent issues from affecting the normal operation of OceanBase Database components.

    • If the memory usage by the Agent is within an acceptable range (such as 2 GB or less), it will not affect the operation of OceanBase Database. In this case, you can perform the following actions:

      1. Save the environment context information and then immediately restart the Agent.

      2. Provide the environment context information to the O&M personnel. The information includes:

        • The memory usage of the current process and its parent process (ocp_agentd is the parent process of the current process).

        • The memory performance analysis file of the current process.

          PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid)
          SOCKET=$PID
          # Coroutine performance data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt
          # CPU performance sampling data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz
          # Memory sampling data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz
          

    Contact Us