Alert description
This alert is triggered when the number of Agent's goroutines (which are similar to threads in Go) exceeds the threshold.
OCP provides two types of Agent services: ocp_monagent for monitoring and ocp_mgragent for O&M. These are essential for managing and monitoring OceanBase Database.
Alert principle
| Parameter | Value |
|---|---|
| Monitoring metric | host_agent_goroutine_count indicates the number of Go coroutines (goroutines) in the Agent process. |
| Metric source | It relies on the Go process monitoring provided by Prometheus, collected from the process's self-monitoring, where: http://localhost:62888/metrics/stathttp://localhost:62889/metrics/stat |
| Metric collection | go_goroutines |
| Monitoring expression | max(go_goroutines{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 minute |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Warning | Server |
Rule information
| Monitoring metric | Default threshold | Monitoring metric source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| host_agent_goroutine_count | 3000 | Self-monitoring of the process | 60 seconds | 5 minutes |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: svr_ip=xxx.xxx.xxx.xxx:process=ocp_monagent The number of Agent goroutines exceeds the limit.
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: Agent process: ${process}, the number of goroutines ${value} exceeds the limit of ${alarm_threshold}.
- Example: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: Agent process: ocp_monagent, the number of goroutines 3500 exceeds the limit of 3000.
Alert recovery
- Template: Alert: ${alarm_name}, the number of Agent goroutines: ${value}
- Example: Alert: The number of Agent goroutines exceeds the limit, the number of Agent goroutines: 950
Impact on the system
The Agent process is an important tool for OCP operations and monitoring of OceanBase Database. Its stability is crucial. When the Agent process consumes excessive system resources, it can affect the operation of OceanBase Database.
The Agent process is implemented based on the Go language, which has the advantages of low resource consumption, high concurrency, and high performance. Compared with threads, Go routines (goroutines) consume fewer resources. Typically, a single goroutine consumes resources at the level of kilobytes. However, the number of Go routines should not be excessively high. If the number exceeds several hundred thousand, it can lead to performance issues such as high load, slow garbage collection (GC), and slow response. Under normal circumstances, the number of Go routines in the Agent process is usually less than 100. If the number of goroutines increases uncontrollably, it needs to be addressed.
Possible causes
There is a resource leak in the goroutines, such as improper handling of locks, concurrent processing, or channels.
In complex scenarios, the Agent process may consume excessive resources, and complex environments can lead to resource leaks.
Solution
When an alert is triggered, check the alert details to confirm the memory usage of the Agent process.
If the memory usage is excessively high (exceeding 10 GB) or the number of Go routines exceeds 100,000, immediately restart the Agent process to prevent the issue from affecting the normal operation of OceanBase Database components.
If the memory usage of the Agent process is within an acceptable range (such as 2 GB), it will not affect the operation of OceanBase Database. In this case, it can be confirmed that the process memory increase is due to a goroutine leak. Perform the following actions:
Save the environment context information and immediately restart the Agent process.
Provide the environment context information to the O&M personnel, which includes:
The memory usage of the current process and its parent process (ocp_agentd is the parent process of the current process).
The memory performance analysis file of the current process.
PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid) SOCKET=$PID # Goroutine performance data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt # CPU performance sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz # Memory sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz