Tenant memory overrun|V3.2.4|OceanBase Database| docs|Distributed Database

Tenant memory overrun

Last Updated：2023-10-27 09:57:43 Updated

This topic describes how to troubleshoot memory overrun.

Symptom

If the memory usage of a tenant in a cluster exceeds the upper limit or an exception occurs, the client returns an error message: Over tenant memory limits. The error code is 4030 or 4013. OceanBase Cloud Platform (OCP) receives a log alert shown in the following figure:

After you receive the alert, you need to log on to the sys tenant of the OceanBase cluster where the object that triggers the alert resides, and locate the cause.

Troubleshooting logic

In most cases, a tenant memory usage exception is related to user behaviors, such as slow SQL queries, large transactions, and improper configurations. Therefore, you can locate the cause by using the performance trend and SQL monitoring features of OCP. If you cannot eliminate the exceptions due to common causes, contact OceanBase Technical Support.

Check tenant performance trends

Before you check the tenant-level performance monitoring information, log on to the OCP console as the sys tenant of the cluster, obtain the value of tenant_id from the error log, and run the following SQL statement to obtain the name of the tenant for which the memory usage alert is reported. In the log alert in the preceding figure, tenant_id is 1006. The corresponding SQL statement is:

obclient> SELECT tenant_name FROM __all_tenant WHERE tenant_id IN (1006);

After you obtain the name of the tenant, go to the Performance Monitoring page for the tenant in OCP. For more information, see Monitor and diagnose the tenant performance.

Pay attention to the QPS, Response time, TPS, Transaction response time, and MemStore metrics on the Performance Monitoring page of the tenant. Identify the point in time when the alert was triggered, and check whether the alert is triggered due to the estimated causes. Usually, variations of the preceding monitoring metrics near the occurrence time of the alert may encompass the following cases:

The transaction response time suddenly increases significantly, while other metrics show no obvious changes. In addition, the MEMStore metric is below the threshold for a major compaction. This indicates that large queries exist and that the large queries have consumed memory exceeding the specified threshold, thus triggering the alert.
The transaction response time suddenly increases significantly, and the response time also increases, but other metrics do not change much. In addition, the MEMStore metric is below the threshold for major compaction. This indicates that the tenant has executed some large transactions, which have consumed memory exceeding the specified threshold, thus triggering the alert and affecting the query performance.
Both QPS and TPS decrease, but the transaction response time increases, while the MEMStore metric has reached the threshold for a major compaction but does not drop. This indicates that the memory of the tenant cannot be released through minor compactions or major compactions. The log alert is probably caused by a minor compaction exception or a major compaction exception. For more information, see Troubleshoot major compaction exceptions.

Check SQL monitoring information

If the log alert is caused by large queries or large transactions but not major compaction exceptions, use the SlowSQL monitoring feature in OCP to identify the root cause.

Troubleshooting methods

If the memory of a tenant is insufficient, check whether the issue is generated due to common causes such as slow SQL queries, large transactions, and improper configurations. If not, collect relevant information and contact OceanBase Technical Support.

Common causes

The solution to a tenant memory overrun exception varies with the root cause. The following content describes the solutions in descending order of preferences and benefits:

Large queries: For a tenant memory overrun caused by large queries or bad SQL statements, locate the SQL statement that causes the issue and optimize the SQL statement to reduce the memory consumption, thus preventing the alert.
Large transactions: For an "insufficient tenant memory" error caused by large transactions or batch data import or export, locate the large transaction that causes the issue and optimize the transaction processing to eventually turn the large transaction into small transactions.
Improper tenant variable configurations: If the SQL statements and transaction size have been minimized to the greatest degree allowed, set ob_sql_work_area_percentage to a larger value for the tenant to increase the memory allowed for a session. The default value is 5, indicating 5% of a single memory resource unit of the tenant. The variable can be configured at the global level. The recommended value range is [5,80].

Sample statement:
- Global level
```
obclient> SET GLOBAL ob_sql_work_area_percentage = xxx;
```
Scale up the memory resources for the tenant. If the alert is not triggered by all the preceding causes, the issue may be caused by insufficient tenant memory. In this scenario, you may scale up the memory resources for the tenant. For more information, see Overview of tenant management. If the alert is caused by a minor compaction exception or a major compaction exception, see Troubleshoot major compaction exceptions.

Collect relevant information

Locate the module that uses increasing memory and monitor the memory allocation for this module. After the exception recurs, identify the cause of growing memory consumption or leakage. Perform the following steps:

Locate the problematic module
1. Query the __all_virtual_memory_info table. Sample command:
```
SELECT * FROM __all_virtual_memory_info  ORDER BY used desc limit 10;
```
2. View logs in the observer.log file and find the modules that consume the most proportion of memory. The following figure shows the memory consumption of modules in the observer.log file.
Enable memory allocation monitoring for the module.
```
ALTER system SET leak_mod_to_check='xxx';
```
The following sample statement shows how to obtain the memory allocation code path of the OB_TEST2_PCODE module in the libeasy context:
```
ALTER system SET leak_mod_to_check='OB_TEST2_PCODE'
```
Observe the exception recurrence.

Monitor the memory usage of the module and observe the exception recurrence by checking the observer.log file or querying the __all_virtual_memory_info table as instructed in Step 1.
Execute the following statement and save the returned results in plain text:
```
SELECT * FROM __all_virtual_mem_leak_checker_info  ORDER BY alloc_count;
```
Stop memory allocation monitoring.

You can obtain the memory allocation code path in the returned results in Step 4. The path is a hex value of the memory address and indicates the cause for growing memory consumption or leakage. You can execute the following statement to stop memory allocation monitoring:
```
 ALTER system set leak_mod_to_check= '';
```
Run the addr2line -pCfe $observer $back_trace command to print the stack information in the back_trace field obtained in Step 3. Example:
```
 addr2line -pCfe /home/admin/oceanbase/bin/observer $back_trace
```
Send the information obtained in Steps 3, 4, and 6 to OceanBase Technical Support for assistance.