Description
This alert is triggered when the total physical memory usage of the OBServer exceeds the threshold.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_host_mem_percent |
| Source | The data is collected by using the node_exporter process. |
| Collected metrics | node_memory_MemFree_bytes, node_memory_Cached_bytes, node_memory_Buffers_bytes, and node_memory_MemTotal_bytes |
| Metric expression | (1 - (avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Cached_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / avg(node_memory_MemTotal_bytes{@LABELS}) by (@GBLABELS)) * 100 |
| Collection cycle | 1 second |
Note
The metric source of this alert is special. OCP-Agent collects monitoring data by using the node_exporter process.
The value of the metric ob_host_mem_percent indicates the memory usage of the OBServer. When this value exceeds the threshold, this alert is triggered. The default threshold is 90%.
Alert rule
| Metric | Default threshold (unit: %) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_host_mem_percent | 90 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}. The memory usage is ${value}%, exceeding the threshold of ${alarm_threshold}%.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The memory usage of the OBServer exceeds the threshold.
Details: Cluster: obcluster-1, Host: xxx.xxx.xxx.xxx, Alert: The memory usage of the OBServer exceeds the threshold. The memory usage is 91%, exceeding the threshold of 90%.
Impact on the system
The OBServer cannot properly work if the server memory is insufficient.
Possible causes
A non-observer process is executing a memory-intensive task.
An OBServer malfunctions. For example, the memory usage of the sys tenant (ID = 500) of the OceanBase cluster exceeds the threshold.
Suggested solutions
Confirm the OBServer status.
In the OBServers list on the Overview page of the cluster, view the status of the server. Alternatively, you can connect to the sys tenant of the OceanBase cluster and execute the following SQL statement:
select status from __all_sever where svr_ip='your server ip';If the status is Inactive, restart the observer process.
If the status is Running, go to the next step.
Verify whether the excessive memory usage is caused by daily business operations.
On the Host Overview page of the OCP console, find the OBServer that triggered the alert and go to its details page. Then, choose Monitoring > Host Resources to view the memory curve chart.
If the memory usage suddenly increases at the alert time, it is not normal. Go to the next step for troubleshooting.
If the memory usage mildly increases at the alert time, it indicates the routine usage by the applications.
You can replace the OBServer with one that has a larger memory size. For more information, see Replace an OBServer.
Run the following command to check for processes that use the most memory resources:
# Find out the top five processes that use the most memory resources and sort them by memory usage in descending order. ps -o %mem,pid,cmd -ax | sort -rn | head -5If a non-observer process uses too many memory resources, analyze the corresponding application to find the cause of the high memory usage.
Otherwise, you can stop the processes that use the most memory resources.
If an observer process uses too many memory resources, the OBServer may have encountered an error.
In this case, the ob_tenant500_mem_hold_over_threshold alert is usually triggered at the same time. We recommend that you resolve that alert first. Then, check whether the ob_host_mem_percent_over_threshold alert recurs 5 minutes later.