host_agent_open_fd_count_over_threshold

2025-03-26 07:47:21  Updated

Description

This alert is triggered when the number of file handles of an Agent process exceeds the threshold.

The Agent service consists of the monitoring Agent process ocp_monagent and the O&M Agent process ocp_mgragent. It is an important tool for managing and monitoring OceanBase databases in OceanBase Cloud Platform (OCP).

Principle

Parameter Value
Metric host_agent_open_fd_count The number of file handles of an Agent process.
Source Depends on the process-exporter provided by Prometheus. Metrics are collected from the process-exporter.
  • The source for the O&M Agent process: http://localhost:62888/metrics/stat
  • The source for the monitoring Agent process: http://localhost:62889/metrics/stat
  • Collected metric process_open_fds
    Metric expression max(process_open_fds{@LABELS}) by (@GBLABELS)
    Collection cycle 1 minute

    Alert information

    Trigger method Alert level Scope
    Based on the expression of the metric Warning Server

    Alert rule

    Metric Default threshold Source Detection cycle Elimination cycle
    host_agent_open_fd_count 1000 process-exporter 60 seconds 5 minutes

    Alert templates

    • Overview: ${alarm_target} ${alarm_name}

    • Details: Server: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: Agent process: ${process}. The number of file handles is ${value}, exceeding the threshold of ${alarm_threshold}.

    • Overview example: svr_ip=xxx.xxx.xxx.xxx:process=ocp_monagent Excessive number of Agent file handles on the server

    • Details example:

      Server: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: Agent process: ocp_monagent. The number of file handles is 1200, exceeding the threshold of 1000.

    Impact on the system

    Agent processes are important tools for OceanBase database O&M and monitoring in OCP. It is crucial to keep Agent processes stable. The number of file handles is an important metric to evaluate the stability of a process. An increasing number of file handles indicates that the system may encounter handle leakage.

    Possible causes

    1. The Agent process for monitoring may not release resources promptly after completing operations such as read and write of databases, logs, and configuration files in a collection task.

    2. Resource errors may occur when the O&M Agent process handles or tracks the logs of OceanBase databases.

    Suggested solutions

    When the alert is triggered, view the alert details to check the memory or file handles used by the Agent process.

    • If more than 10 GB of memory or more than 65,535 file handles are used, restart the Agent process immediately to avoid impact on OceanBase database running.

    • If the Agent process uses an acceptable amount of memory, such as memory within 2 GB, which does not affect OceanBase database running, you can perform the following operations:

      1. Save the environment context and then restart the Agent process immediately.

      2. Send the following context information to the O&M engineer:

        • The memory used by the Agent process and its parent process ocp_agentd.

        • The memory performance analysis file for the Agent process.

          PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid)
          SOCKET=$PID
          # Goroutine performance data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt
          # CPU performance sampling data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz
          # Memory sampling data
          curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz
          

    Contact Us