ob_host_mem_percent_over_threshold

2023-08-15 11:20:56  Updated

Description

This alert is triggered when the total physical memory usage of the OBServer exceeds the threshold.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_host_mem_percent
Source The data is collected by using the node_exporter process.
Collected metrics node_memory_MemFree_bytes, node_memory_Cached_bytes, node_memory_Buffers_bytes, and node_memory_MemTotal_bytes
Metric expression (1 - (avg(node_memory_MemFree_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Cached_bytes{@LABELS}) by (@GBLABELS) + avg(node_memory_Buffers_bytes{@LABELS}) by (@GBLABELS)) / avg(node_memory_MemTotal_bytes{@LABELS}) by (@GBLABELS)) * 100
Collection cycle 1 second

Note

The metric source of this alert is special. OCP-Agent collects monitoring data by using the node_exporter process.

The value of the metric ob_host_mem_percent indicates the memory usage of the OBServer. When this value exceeds the threshold, this alert is triggered. The default threshold is 90%.

Alert rule

Metric Default threshold (unit: %) Duration Detection cycle Time before clearance
ob_host_mem_percent 90 0 seconds 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Metric expression Critical Server

Alert templates

  • Overview: ${alarm_target} ${alarm_name}

  • Details: ${alarm_target} ${alarm_name}. The memory usage is ${value}%, exceeding the threshold of ${alarm_threshold}%.

  • Overview example: ob_cluster=C1-1000:svr_ip=192.168.1.1. The memory usage of the OBServer exceeds the threshold.

  • Details example: ob_cluster=C1-1000:svr_ip=192.168.1.1. The memory usage of the OBServer is 91.0%, exceeding the threshold of 90.0%.

Impact on the system

The OBServer cannot properly work if the server memory is insufficient.

Possible causes

  • A non-observer process is executing a memory-intensive task.

  • An OBServer malfunctions. For example, the memory usage of the sys tenant (ID = 500) of the OceanBase cluster exceeds the threshold.

Suggested solutions

  1. Confirm the OBServer status.

    In the OBServers list on the Overview page of the cluster, view the status of the server. Alternatively, you can connect to the sys tenant of the OceanBase cluster and execute the following SQL statement:

    select status from __all_sever where svr_ip='your server ip';
    
    • If the status is Inactive, restart the observer process.

    • If the status is Running, go to the next step.

  2. Verify whether the excessive memory usage is caused by daily business operations.

    On the Host Overview page of the OCP console, find the OBServer that triggered the alert and go to its details page. Then, choose Monitoring > Host Resources to view the memory curve chart.

    • If the memory usage suddenly increases at the alert time, it is not normal. Go to the next step for troubleshooting.

    • If the memory usage mildly increases at the alert time, it indicates the routine usage by the applications.

      You can replace the OBServer with one that has a larger memory size. For more information, see Replace an OBServer.

  3. Run the following command to check for processes that use the most memory resources:

    # Find out the top five processes that use the most memory resources and sort them by memory usage in descending order.
    ps -o %mem,pid,cmd  -ax | sort -rn | head -5
    
    • If a non-observer process uses too many memory resources, analyze the corresponding application to find the cause of the high memory usage.

      Otherwise, you can stop the processes that use the most memory resources.

    • If an observer process uses too many memory resources, the OBServer may have encountered an error.

      In this case, the ob_tenant500_mem_hold_over_threshold alert is usually triggered at the same time. We recommend that you resolve that alert first. Then, check whether the ob_host_mem_percent_over_threshold alert recurs 5 minutes later.

Contact Us