Description
This alert is triggered when the number of file handles of an Agent process exceeds the threshold.
The Agent service consists of the monitoring Agent process ocp_monagent and the O&M Agent process ocp_mgragent. It is an important tool for managing and monitoring OceanBase databases in OceanBase Cloud Platform (OCP).
Principle
| Parameter | Value |
|---|---|
| Metric | host_agent_open_fd_count The number of file handles of an Agent process. |
| Source | Depends on the process-exporter provided by Prometheus. Metrics are collected from the process-exporter. http://localhost:62888/metrics/stathttp://localhost:62889/metrics/stat |
| Collected metric | process_open_fds |
| Metric expression | max(process_open_fds{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 minute |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Warning | Server |
Alert rule
| Metric | Default threshold | Source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| host_agent_open_fd_count | 1000 | process-exporter | 60 seconds | 5 minutes |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: Server: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: Agent process: ${process}. The number of file handles is ${value}, exceeding the threshold of ${alarm_threshold}.
Overview example: svr_ip=xxx.xxx.xxx.xxx:process=ocp_monagent Excessive number of Agent file handles on the server
Details example:
Server: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: Agent process: ocp_monagent. The number of file handles is 1200, exceeding the threshold of 1000.
Impact on the system
Agent processes are important tools for OceanBase database O&M and monitoring in OCP. It is crucial to keep Agent processes stable. The number of file handles is an important metric to evaluate the stability of a process. An increasing number of file handles indicates that the system may encounter handle leakage.
Possible causes
The Agent process for monitoring may not release resources promptly after completing operations such as read and write of databases, logs, and configuration files in a collection task.
Resource errors may occur when the O&M Agent process handles or tracks the logs of OceanBase databases.
Suggested solutions
When the alert is triggered, view the alert details to check the memory or file handles used by the Agent process.
If more than 10 GB of memory or more than 65,535 file handles are used, restart the Agent process immediately to avoid impact on OceanBase database running.
If the Agent process uses an acceptable amount of memory, such as memory within 2 GB, which does not affect OceanBase database running, you can perform the following operations:
Save the environment context and then restart the Agent process immediately.
Send the following context information to the O&M engineer:
The memory used by the Agent process and its parent process ocp_agentd.
The memory performance analysis file for the Agent process.
PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid) SOCKET=$PID # Goroutine performance data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt # CPU performance sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz # Memory sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz