This topic describes how to troubleshoot overloaded OBServer nodes.
Symptom
Assume that the physical resources on a server are fully allocated to an OBServer node. Then, if the CPU utilization increases to more than 85%, the system may reach a bottleneck and the OBServer node cannot obtain sufficient CPU resources to execute requests. As a result, requests such as tenant requests and system requests sent to OceanBase Database are delayed, blocked, or failed. To prevent or resolve the issue, you can analyze the causes and perform operations based on the following logic.
Troubleshooting logic
OceanBase Database is a distributed database system based on a multi-tenant architecture. Applications connect to the database and perform operations on the database by using tenants. When you initialize an OceanBase Database tenant, you need to allocate a resource unit to the tenant to configure CPU, memory, and other resources. If the resource unit of the tenant is not enough to handle the load, the request queue of the tenant is accumulated. This results in the delay or interruption of requests and affects the performance of applications.
If the execution of requests is delayed or slowed down on an OBServer node along with a high CPU utilization, you need to check whether the system reaches a performance bottleneck. Perform the following operations:
Check the OBServer node for resource shortage.
Run the
topcommand to view the real-time CPU utilization, thetsarcommand to view historical CPU utilization, or thetopHcommand to view the threads that consume the most CPU resources. Check whether the tenant threads or background threads consume more CPU resources based on the results, and perform the following analysis:a. The business load changes.
Observe the CPU utilization of the threads, including the number of concurrent threads and the types of threads, and check whether the business load is increasing. If yes, the issue may be caused by slow SQL queries. Locate and optimize the corresponding SQL statements, or throttle the business load.
b. There is no significant change in the business load.
If the CPU utilization of the OBServer node suddenly increases and you cannot identify direct causes, the issue may be caused by the internal exceptions of OceanBase Database. Contact OceanBase Technical Support to obtain OBStack for further analysis of the thread stack.
The request execution in a tenant is slow, delayed, or interrupted, but the overall CPU utilization of the OBServer node does not soar.
In this case, the request queue of the tenant may be accumulated. You can search for the
dump tenant infokeywords in theobserver.logfile. If you find that the value of thereq_queuefield of the corresponding tenant is not 0, the request queue is accumulated. You can perform the following analysis:a. The resource specifications of the tenant are too low.
If a tenant is used for the first time, resources are allocated to the tenant based on the expected business volume. If the resource specifications of the tenant are too low, requests to the tenant tend to be accumulated, and the tenant or OBServer node can be overloaded. If the resources are sufficient, you can upgrade the resource specifications or increase the number of units to enhance the tenant capacity.
b. Highly loaded and frequently changed SQL queries are processed in the system.
Use the TopSQL feature of OceanBase Cloud Platform (OCP) to diagnose queries.
c. There is no significant change in the business load.
In this case, if the resource specifications of the tenant are reasonable, the issue may be caused by the internal exceptions of OceanBase Database. Contact OceanBase Technical Support to obtain OBStack for further analysis of the thread stack.