The general process of handling database emergencies consists of several steps, including event awareness, abnormal information collection, analysis and diagnostics, and decision-making. This topic provides several analysis and decision-making methods to help you quickly locate and identify the root cause of common issues.
Event awareness
Emergency events are usually identified through application system alerts, database alerts, user feedback, or routine inspections. Regardless of the method used, the first step in handling an emergency event is to synchronize fault information (including current impact and the person taking over the case) and quickly collect error information, to provide a basis for subsequent analysis and emergency handling.
Collect exception information
In most cases, exception information should be organized based on the following dimensions:
Exception manifestation: Generally, the first level consists of business-related information, such as failed user login, order submission failure, and page access failure. The first step in collecting exception information is to locate database-related alerts from business alarms and monitoring information. For example, database-related exceptions reported in the DAL layer logs of the application, middleware, and client can be used.
Time of exception: The specific start time and duration of the fault or exception need to be identified.
Scope of exception: Determine whether the exception affects a single application or all applications, and whether the OB cluster is globally affected, or only specific clusters or tenants. For a global impact, infrastructure exceptions are often the starting point for troubleshooting. For local faults, the specific cluster, tenant, database, and obproxy details need to be confirmed.
Error-related logs: Collect error logs generated during the exception, generally from application DAL layer logs and client logs.
Changes or releases: Confirm whether there were any business changes or releases before and after the fault time. If so, specific information needs to be collected. In actual scenarios, most faults and exceptions are caused by changes.
Impact on associated businesses: Determine the impact scope of the current fault and whether it will affect upstream or downstream systems. Use this information to decide whether to proactively implement degradation or throttling. This not only involves databases but is part of the overall strategy for site availability (SRE).
Analysis, diagnostics, and decision tree
As mentioned earlier, the sources of emergency events can be classified into two categories: application exceptions and database inspection and alerts. For the latter, specific DB events are generally involved. The detailed emergency handling methods will be analyzed in the following sections of this chapter. This section focuses on business-side errors and exceptions. It provides a step-by-step analysis and judgment process to help you quickly locate the problem.
Quick troubleshooting
Investigation Approach
Perform a quick health check of the OB cluster.
Before you troubleshoot any emergencies, eliminate the most basic exceptions as early as possible to avoid ineffective troubleshooting. The following factors are the most basic software and hardware factors that affect the normal work of OceanBase:
Whether the NTP clock is synchronized.
Whether the server is down.
Whether the log disk or data disk is full.
Whether the network in the IDC is jittery.
Whether the load balancing device or component (such as F5 or LVS) is faulty.
For basic health check items of the cluster, OceanBase Cloud Platform provides the quick health inspection feature. For more information, see User Guide.
Check whether the business traffic surges during the exception period.
After you eliminate the most basic external exceptions, check whether the number of external business requests surges during the exception period, that is, whether the surge in business traffic causes resource shortages.
Analyze database-related errors or exceptions from the application layer.
In practical production, you can quickly confirm only a part of the information (such as clear hardware failures, network disconnection, disk fullness, and traffic surge). In real scenarios, many factors cannot be quickly confirmed at the beginning (such as I/O and network jitter, data distribution changes, and concentrated business logic with fewer hits). In this case, you need to further analyze the problem. Generally, database-related errors and exceptions exposed from the application layer can be classified into the following categories:
The application connection pool is full.
Application requests time out.
Application connection failure.
Application write failure.
Application lock conflict.
The following sections describe several scenarios in detail.
Application connection pool is full/Application requests time out
Connection pool full is one of the most common database exceptions from the application side. Many exceptions are transmitted layer by layer, and in the end, the application connection pool is full, and new connections cannot be established. The direct cause of a full connection pool is that application requests time out. In actual situations, the factors that cause the problem are roughly ranked by frequency as follows:
There are slow SQL statements with abnormal execution plans in the database, which increases the response time. As a result, the application retries the connection establishment.
In this case, you can intervene in the execution plans of slow SQL statements or directly implement throttling. For more information, see Exceptions caused by SQL queries.
SQL statements in the database have a long response time (but no obvious abnormal execution plans are found), or a transaction has not been committed for a long time. In this case, the following situations generally occur:
View the requests of ongoing sessions and determine the transaction execution status based on business information. Generally, OBServer nodes encounter I/O, memory, or network exceptions. In this case, you can isolate the suspected nodes to observe whether they can recover, and then perform degradation processing for the specific exceptions. For more information, see Exceptions caused by other hardware and network issues.
No clear hardware issues are found, but the I/O load of the suspected nodes or the network card load is very high, which causes SQL request timeout. For more information, see High network card load on a node and High I/O load on a node.
Another factor that causes abnormal SQL query time is a unique phenomenon in databases with the LSM-tree architecture, which is called "buffer table issue" in OceanBase Database. For more information, see Buffer table issue.
The database node has insufficient CPU or memory resources. Insufficient CPU resources cause queue backlogs of tenants, and full memory causes write failures. As a result, SQL response time increases.
In this case, you can check the CPU level and tenant memory level first. Then, you can perform scaling up or take other corresponding measures as needed. For more information, see Tenant queue backlogs and Tenant memory full.
Application connection failure
Application connection failure can be divided into two cases: failure to connect to obproxy and failure to connect to observer. Connection timeouts are more common than direct connection failures.
Failure to connect to obproxy
If an application fails to connect to obproxy, first rule out network link issues between the application and obproxy. Then check if load balancing devices or components are configured on the obproxy and if they are faulty. If neither of the above issues exists, the problem might be with obproxy itself or due to high traffic causing obproxy thread saturation (error messages similar to
too many sessions). For obproxy thread saturation, see ODP Thread Full. For obproxy itself, see ODP Endpoint Fault.Failure to connect to observer
If an application fails to connect to observer, first rule out network issues between obproxy and observer. In a stable infrastructure environment, network link issues are actually less likely to occur. If there are no clear indicators pointing to network issues, there is no need to spend too much time analyzing them. More often than not, the issue lies with observer itself. In this case, you can analyze and locate the issue through the following paths:
Check if there is queue backpressure in the sys tenant or other internal tenants. Check the observer logs to determine if there is queue backpressure in the SYS tenant. The OceanBase cluster first processes connections for the sys tenant, and a failure to process connections for the sys tenant will result in connection failures. For more information, see Queue Backpressure in the sys Tenant.
Check if the CPU or I/O of the node is spiking. Check the observer logs to determine if there is queue backpressure in user tenants. Insufficient resources for user tenants can also cause connection timeouts. For more information, see Queue Backpressure in User Tenants.
Check if abnormal SQL queries are causing resource usage issues. For more information, see Abnormal SQL Queries.
Application write failure
OceanBase write-related issues can be broadly categorized, with most related to memory. In actual production scenarios, these issues can be analyzed and diagnosed based on their frequency, starting from the following levels:
Memory full due to bulk write operations (such as bulk import, correction, and deletion) in a short period.
This is the most common scenario, with error messages often indicating
Exceeded tenant memory limit. OceanBase Database is a quasi-memory database based on the LSM-tree structure. Any write operation consumes memory resources, which can only be released after minor or major compactions. Therefore, if the compaction speed cannot keep up with the memory consumption speed, the memory will eventually be exhausted, leading to application write drop. For more information, see Tenant Memory Full.Memory full due to stuck long or hung transactions, causing compactions to be stuck and memory to fail to be released, ultimately leading to memory full.
Similar to memory full, another cause of memory full is uncommitted transactions that prevent active memory from being frozen and compacted, ultimately leading to a memory overflow. You can identify stuck long transactions by looking at statements that have not been completed for a long time or by using SQL commands.
Other modules in the tenant consuming excessive memory, making it unavailable for MemStore allocation, resulting in a memory full error.
In this scenario, the MemStore is not full, but write operations fail. For more information, see Insufficient Memory in System Internal Modules or Memory Leak.
Clog disk space full, preventing the majority of replicas from synchronizing, leaving the partition leaderless and blocking writes.
Clogs are the transaction logs of OceanBase Database, which adopts the WAL mechanism. When the disk is full, write submissions will definitely fail. For more information, see Node Log Disk Space Full.
RS module exception, causing freeze compaction to be stuck, blocking writes during the freeze compaction period.
This scenario is relatively rare. When the number of written rows in OceanBase Database reaches a certain threshold or before daily major compactions, a major freeze is performed, which involves freezing the active memory and then performing minor or major compactions. In normal situations, the freeze process takes only 1 second, but in exceptional situations, it may take longer. During this period, writes are blocked. For emergency handling of this issue, you can manually switch or restart the RS. For more information, see Freeze or Compaction of the Cluster Stuck.
Application lock conflict
When concurrent updates are performed on specific rows in some scenarios, row lock conflicts are likely to occur, which can lead to longer wait times for requests. This issue can further cause the application connection pool to be full. Lock conflicts generally occur in the following situations:
The SQL response time (RT) on the business side increases, leading to lock conflicts. In this case, you need to address the issue of high RT. For more information, see the section Application connection pool full/Application request timeout in the previous chapter.
Uncommitted transactions or hanging transactions exist on the rows.
Lock conflicts occur due to concurrent updates on hot rows, which in turn leads to high RT.
Hotspot row updates are a common issue in relational databases, and there are often no good solutions for handling them during emergency situations (due to the nature of the business, which typically involves considering rate limiting or service degradation). From a long-term optimization perspective, one can consider optimizing and adjusting transaction logic (e.g., using select for update wait 1). OceanBase has been actively exploring optimizations for hotspot concurrency and has already implemented effective solutions. For more details, refer to OceanBase's Early Row Lock Release feature, which enhances the performance of single-row concurrency.
Change implementation/rollback & considerations
After the preceding chapters, you should have a better understanding of how to handle common emergency situations. The remaining task is to implement the necessary emergency measures to restore the database to normal operation. When handling emergencies in OceanBase Database, pay attention to the following points:
Clearly document and verify each emergency change operation, which can serve as a reference for further diagnostics if the recovery fails.
It is recommended to record backup site information as much as possible, including observer logs and core files during the abnormal period, etc.
By default, the emergency recovery operations discussed in this chapter are performed without compromising the majority of the cluster data or compromising data consistency. If you need to activate a single replica or handle clogs in multiple replicas, contact OceanBase Technical Support.
After the emergency actions are completed and the business is restored, it is recommended to roll back the parameter changes and conduct a post-mortem on the issue or failure based on the on-site information preserved during the emergency. This will allow you to conduct root cause analysis on the issue.