Description
This alert is triggered when the sql_audit data loss rate of OCP-Agent exceeds the threshold.
Principle
| Parameter | Value |
|---|---|
| Metric | sql_audit_collect_lost_percent |
| Source | Statistics of ocp_monagent |
| Collected metric |
|
| Metric expression | 100 * sum(rate(sql_audit_input_collect_lost_total{@LABELS}[@INTERVAL])) by (@GBLABELS) / sum(rate(sql_audit_input_collect_total_total{@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Collection cycle | 1 minute |
Alert rule
| Metric expression | Metric description | Default threshold (unit: percentage) | Detection cycle | Time before clearance |
|---|---|---|---|---|
| sql_audit_collect_lost_percent | The loss rate of the sql_audit data | 20% | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Warning | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: host=xxx.xxx.xxx.xxx. The sql_audit data loss rate exceeds the threshold.
- Details
- Template: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: The sql_audit data loss rate is ${value}, which exceeds the ${alarm_threshold}% threshold.
- Example: Cluster: TEST. Host: xxx.xxx.xxx.xxx. Alert: The sql_audit data loss rate is 23.238, which exceeds the 20% threshold.
Impact on the system
The collected sql_audit data is used for SQL diagnostics of Autonomy Service. If the data loss rate is high, some SQL information is not covered in SQL diagnostics.
Possible causes
- The traffic on the host exceeds the processing capacity of the ocp_monagent process, which results in the loss of the sql_audit data.
- The tenant receiving massive SQL requests on the OBServer node is not the tenant that executes these SQL requests.
Suggested solutions
View the CPU utilization and memory usage of the ocp_monagent process on the OBServer node. If the values are high or reach the thresholds, you can obtain the environmental context and the monagent.log file in the ocp_agent/log/ directory, and provide them to an O&M engineer. You can use the following method to obtain the environmental context:
PID=$(cat /home/admin/ocp_agent/run/ocp_monagent.pid)
SOCKET=$PID
# Goroutine performance data
curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://11/debug/pprof/goroutine?debug=1 --output /tmp/goroutine.txt
# CPU performance sampling data
curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/profile?seconds=30 --output pprof.profile.gz
# Memory sampling data
curl --unix-socket /home/admin/ocp_agent/run/ocp_monagent.$PID.sock http://localhost/debug/pprof/heap --output pprof.heap.gz