Description
This alert is triggered when the number of goroutines of an Agent process exceeds the threshold. A goroutine in Go is similar to a thread in other programming languages.
The Agent service consists of the monitoring Agent process ocp_monagent and the O&M Agent process ocp_mgragent. It is an important tool for managing and monitoring OceanBase databases in OceanBase Cloud Platform (OCP).
Principle
| Parameter | Value | |-------------------|--| | Metric | host_agent_goroutine_count The number of goroutines of the Agent process. | | Source | Depends on the process-exporter for Go provided by Prometheus. Metrics are collected from the process-exporter.
- The source for the O&M Agent process: http://localhost:62888/metrics/stat
- The source for the monitoring Agent process: http://localhost:62889/metrics/stat
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Warning | Server |
Alert rule
| Metric | Default threshold | Source | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| host_agent_goroutine_count | 3000 | process-exporter | 60 seconds | 5 minutes |
Alert templates
- Overview: ${alarm_target} ${alarm_name}
- Details: Server: ${svr_ip}, Agent process: ${process}. The number of goroutines is ${value}, exceeding the threshold of ${alarm_threshold}.
- Overview example: svr_ip=192.168.0.1:process=ocp_monagent Excessive number of Agent goroutines on the server
- Details example: Server: 192.168.0.1, Agent process: ocp_monagent. The number of goroutines is 3500, exceeding the threshold of 3000.
Impact on the system
Agent processes are important tools for OceanBase database O&M and monitoring in OCP. It is crucial to keep Agent processes stable. If Agent processes use excessive system resources, OceanBase database running will be affected.
Agent processes are implemented based on the Go language. The processes feature little resource usage but high concurrency and performance. Compared with threads, goroutines use less resources. Generally, a single goroutine only uses resources of less than 1 MB. However, an excessive number of goroutines may also cause performance problems. For example, hundreds of thousands of goroutines will lead to high system load, slow garbage collection (GC), and slow response. An Agent process often has no more than 100 goroutines. You need to pay attention to an uncontrollable increase of goroutines.
Possible causes
The goroutine resources of the Agent process were leaked, causing improper processing of locks, concurrency, or channels.
The Agent process improperly used resources in complex scenarios, which might lead to resource leakage.
Suggested solutions
When the alert is triggered, view the alert details to check the memory used by the Agent process.
If more than 10 GB of memory or more than 100000 goroutines are used, restart the Agent process immediately to avoid impact on OceanBase database running.
If the Agent process uses an acceptable amount of memory, such as memory within 2 GB, which does not affect OceanBase database running, it can be confirmed that the goroutine leakage has resulted in the increase of process memory usage. You can perform the following operations:
Save the environment context and then restart the Agent process immediately.
Send the following context information to the O&M engineer:
The memory used by the Agent process and its parent process ocp_agentd.
The memory performance analysis file for the Agent process.
PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid) SOCKET=$PID # Goroutine performance data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt # CPU performance sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz # Memory sampling data curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz