Alert description
This alert is triggered when the number of file handles of the Agent exceeds the threshold.
The Agent service consists of two parts: the monitoring Agent (ocp_monagent) and the O&M Agent (ocp_mgragent). It is an important tool for managing and monitoring OceanBase Database.
Alerting principle
| Parameter | Value |
|---|---|
| Monitoring metric | host_agent_open_fd_count indicates the number of file handles of the Agent process. |
| Metric source | It is collected from the process itself through process monitoring provided by Prometheus. The process monitoring endpoint is as follows: http://localhost:62888/metrics/stat for the O&M Agent.http://localhost:62889/metrics/stat for the Monitoring Agent. |
| Collected metric | process_open_fds |
| Monitoring expression | max(process_open_fds{@LABELS}) by (@GBLABELS) |
| Collection interval | 1 minute |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Expression based on monitoring metrics | Warning | Server |
Rule information
| Monitoring metric | Default threshold | Monitoring metric source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| host_agent_open_fd_count | 1000 | Self-monitoring of the process | 60 seconds | 5 minutes |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx:process=ocp_monagent The number of open file handles for the server Agent has exceeded the limit.
Alert details
- Template: Server: cluster: ${ob_cluster_name}, host: ${host}, alert: Agent process: ${process}, the number of open file handles ${value} has exceeded the limit of ${alarm_threshold}.
- Example: cluster: obcluster-1, host: xxx.xxx.xxx.xxx, alert: Agent process: ocp_monagent, the number of open file handles 1200 has exceeded the limit of 1000.
Alert recovery
- Template: Alert: ${alarm_name}, server Agent file handle count: ${value}
- Example: Alert: The number of open file handles for the server Agent has exceeded the limit, server Agent file handle count: 950
Impact on the system
The Agent process is an important tool for OCP operations and monitoring of OceanBase Database. Its stability is crucial. The number of open file handles is an important indicator of process stability. If the number of open file handles is continuously increasing, there may be a leakage problem in the system.
Possible causes
There may be issues with timely resource closure in the monitoring Agent's collection tasks, such as in database read/write scenarios, log file read/write scenarios, and configuration file read/write scenarios.
The O&M Agent processes and tracks OceanBase Database logs, which may lead to potential resource issues.
Resolution
When an alert is triggered, check the alert details to confirm the memory usage or number of open file handles by the Agent.
If the memory usage is excessively high (exceeding 10 GB) or the number of open file handles exceeds the system threshold (65,535), immediately restart the Agent process to prevent issues from affecting the normal operation of OceanBase Database components.
If the memory usage by the Agent is within an acceptable range (such as 2 GB or less), it will not affect the operation of OceanBase Database. In this case, you can perform the following actions:
Save the environment context information and then immediately restart the Agent.
Provide the environment context information to the O&M personnel. The information includes:
The memory usage of the current process and its parent process (ocp_agentd is the parent process of the current process).
The memory performance analysis file of the current process.
PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid) SOCKET=$PID # Coroutine performance data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt # CPU performance sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz # Memory sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz