The general process of handling database emergencies includes event awareness, abnormal information collection, analysis and diagnostics, and decision-making. This topic describes several analysis and decision-making methods that can help you quickly locate and identify the root cause of common problems.
Event awareness
Emergency events are usually identified through application system alerts, database alerts, user feedback, or routine inspections. Regardless of the method used, the first step in handling an emergency event is to synchronize fault information (including current impact and the person taking over the case) and quickly collect error information, providing a basis for subsequent analysis and handling.
Collect the exception information
In most cases, collect exception information from the following dimensions:
Exception manifestation: Generally, the first layer involves business-related information, such as users being unable to log in, orders failing to be submitted, or specific pages failing to load. The first step in collecting exception information is to locate database-related alerts from business alarms and monitoring information. For example, database-related exceptions reported in the DAL layer logs of the application, middleware, or client can be used.
Exception time: Identify the specific start time and duration of the fault or exception.
Exception scope: Determine whether the exception affects a single application or all applications, and whether the OB cluster is globally affected, or if only a specific cluster or tenant is impacted. For global issues, infrastructure exceptions are often the starting point for troubleshooting. For local issues, you need to specify the cluster, tenant, database, and obproxy details.
Error-related logs: Collect error logs generated during the exception, typically from application DAL layer logs and client logs.
Changes or releases: Check if there were any business releases or changes before and after the fault time. If so, collect the specific details. In most cases, many faults and exceptions are caused by changes.
Impact on associated businesses: Assess the impact of the current fault, and determine if it will affect upstream or downstream systems. Use this information to decide whether to proactively implement degradation or throttling. This not only involves databases but is part of the broader strategy for site availability (SRE).
Analysis, diagnostics, and decision tree
As mentioned earlier, the sources of emergency incidents are divided into two main categories: one is abnormalities detected by applications, and the other comes from routine inspections and alerts within the database itself. For the latter, these generally point to specific DB events, and detailed emergency handling methods will be thoroughly analyzed in subsequent sections of this chapter. This section focuses on errors and anomalies reported from the business side, presenting approaches and methods for progressively analyzing and identifying issues as quickly as possible.
Quick troubleshooting
Investigation Approach
Perform a quick health check of the OB cluster.
Before you troubleshoot any emergencies, it is a good practice to rule out the most basic exceptions as early as possible to avoid ineffective troubleshooting. The most basic software and hardware factors that affect the normal operation of OceanBase Database include:
Is the NTP clock synchronized?
Has a server crashed?
Are the log disk or data disk space full?
Is the IDC network jittery?
Is the load balancing device or component (such as F5 or LVS) faulty?
For basic health check items of a cluster, OceanBase Cloud provides the quick health inspection feature. For more information, see User Guide.
Check whether the business traffic surges during the exception period.
After you rule out the most basic external exceptions, you need to check whether the external business request volume surges during the exception period, that is, whether the surge in normal business traffic causes resource shortages.
Analyze database-related errors or exceptions from the application layer.
In practical production, you can quickly confirm only a part of the information (such as clear hardware crashes, network disconnection, disk fullness, and traffic surge). In real scenarios, many factors cannot be quickly confirmed at the beginning (such as I/O or network jitter, data distribution changes, and concentrated business logic with fewer hits). In such cases, you need to further analyze the issue. Generally, database-related errors and exceptions exposed from the application layer can be classified into the following categories:
The application connection pool is full.
Application requests time out.
Application connection failure.
Application write failure.
Application lock conflict.
The following sections describe several scenarios in detail.
Application connection pool is full/Application request times out
Connection pool full is one of the most common database exceptions from the application side. Many exceptions are passed on layer by layer and will ultimately manifest as a full connection pool on the application side, which cannot establish new connections. The direct cause of a full connection pool is a timeout of application requests. In actual situations, the factors that cause the problem are generally ranked by frequency from high to low as follows:
There are slow SQL statements with abnormal execution plans in the database, leading to prolonged response time and continuous retries of connections from the application side.
In this case, you can intervene in the execution plans of slow SQL statements or directly implement throttling. For more information, see Exceptions caused by SQL queries.
SQL statements in the database respond slowly (without obvious abnormal execution plans) or a transaction cannot be submitted for a long time. In this case, the following situations generally occur:
View the ongoing session requests and determine the transaction execution status based on business information. Generally, the following problems occur with OBServer nodes: I/O, memory, or network issues. In this case, you can try to isolate the suspicious nodes to observe whether they can recover; or implement degradation for the specific exception. For more information, see Hardware and network exceptions.
No clear hardware problems are found, but the I/O load of the nodes or the network card load is excessively high, causing SQL request timeouts. For more information, see High network card load on nodes and High I/O load on nodes.
Another situation that causes abnormal SQL query time is a phenomenon specific to databases with the LSM-tree architecture, which is called "buffer table issue" in OceanBase Database. For more information, see Buffer table issue.
The database node has insufficient CPU or memory resources. Insufficient CPU resources cause queue backlogs for tenants, and full memory causes write failures. This leads to increased SQL response time.
In this case, you can check the CPU level and tenant memory level first. If they are insufficient, you can scale up the resources or take other corresponding measures. For more information, see Tenant queue backlogs and Tenant memory full.
Application connection failure
Application connection failure can be divided into two cases: failure to connect to obproxy and failure to connect to observer. Connection timeouts are more common than direct connection failures.
Failure to connect to obproxy
If an application fails to connect to obproxy, first rule out network link issues between the application and obproxy. Then check if there is a load balancing device or component configured on the upper layer of obproxy, and if it is faulty. If these issues are ruled out, the problem may be with obproxy itself or due to high traffic causing full threads on obproxy (error messages similar to
too many sessions). For obproxy thread exhaustion, see ODP Thread Exhaustion. For obproxy failure, see ODP Endpoint Failure.Failure to connect to observer
If an application fails to connect to observer, we first need to rule out network issues, specifically the path from obproxy to observer. In environments with stable infrastructure, network issues are actually less likely to occur. Unless clear indicators point to a network issue, there's no need to spend much time analyzing it. More often than not, the issue lies with observer itself. In this case, you can analyze and locate the issue through the following paths:
Check if there is queue backpressure in the sys tenant or other internal tenants. Check the observer logs to determine if there is queue backpressure in the SYS tenant. The OceanBase cluster first processes connections for the sys tenant, and a failure can cause connection issues. For more information, see Queue Backpressure in the sys Tenant.
Check if the CPU or I/O of the node is spikes. Check the observer logs to determine if there is queue backpressure in user tenants. Insufficient resources for user tenants can also cause connection timeouts. For more information, see Queue Backpressure in User Tenants.
Check if abnormal SQL queries are causing resource usage issues. For more information, see Abnormal SQL Queries.
Application write failure
OceanBase write-related issues are mainly memory-related. In actual production, you can analyze and diagnose these issues based on the frequency of occurrence from the following aspects:
Memory full for user tenants due to bulk writes (such as bulk imports, corrections, and deletions) within a short time.
This is the most common situation, with error messages often indicating
Over tenant memory limits. OceanBase Database is a quasi-memory database based on the LSM-tree structure. Any write operation consumes memory resources, which cannot be released until major or minor compactions are completed. Therefore, if the compaction speed cannot keep up with the memory consumption, the memory will eventually be exhausted, leading to application write drop. For more information, see Tenant Memory Full.Memory full due to long or hung transactions blocking the compaction process, preventing memory release and leading to memory full.
Similar to memory full, another cause of memory full is uncommitted transactions that prevent inactive memory from being frozen and compacted, leading to a memory overflow. You can check for long-running statements or execute SQL commands to identify ongoing long transactions.
High memory usage in other modules of the tenant, preventing the allocation of memory for the MemStore, resulting in a memory full error.
In this case, the MemStore is not full, but other SQL issues have triggered memory in non-MemStore modules of the tenant. The phenomenon is that the memory usage of the tenant's MemStore is not high, but errors are reported during writes. For more information, see Insufficient Memory in System Internal Modules or Memory Leak.
Clog log disk full, preventing the majority from synchronizing, leaving the partition leaderless and blocking business writes.
Clogs are the transaction logs of OceanBase Database, which adopts the WAL mechanism. When the disk space is full, write submissions will definitely fail. For more information, see Node Log Disk Full.
RS module exception causing freeze compaction to be stuck, preventing writes during the freeze compaction period.
This scenario is relatively rare. When the number of written rows in the OceanBase cluster reaches a certain threshold or before daily major compactions, the system will first perform a major freeze, which involves freezing the active memory and then performing a minor compaction. Normally, the freeze process takes only 1 second, but in exceptional cases, it may take longer. During this period, writes are not possible. In such cases, you can manually switch or restart the RS to resolve the issue. For more information, see Cluster Freeze/Compaction Stuck.
Application lock conflict
When concurrent updates are performed on specific rows in some scenarios, row lock conflicts are likely to occur, which can significantly increase the waiting time for various requests. This, in turn, can lead to application connection pool full issues. Lock conflicts generally occur in the following scenarios:
The SQL response time (RT) on the business side increases, leading to row lock conflicts. In this case, you need to address the issue of high RT. For more information, see the section Application connection pool full/Application request timeout in the preceding chapters.
There are uncommitted transactions or orphan transactions on the rows.
Hotspot row lock conflicts occur due to concurrent updates, which in turn increases the RT.
Hotspot row updates are a common issue in relational databases. In emergency situations, there are often no effective solutions to this problem (due to the nature of business operations, general approaches include throttling and degradation). From a long-term optimization perspective, you can consider optimizing the transaction logic, such as using SELECT FOR UPDATE WAIT 1. OceanBase Database is actively exploring solutions to hotspot concurrency issues and has implemented effective optimization measures. For more information, see the Early release of row locks feature in OceanBase Database, which can improve the concurrency performance for single-row operations.
Change implementation/rollback & considerations
After the preceding chapters, you should have a better understanding of how to handle common emergency situations. The remaining task is to implement the appropriate emergency measures to restore the database to normal operation. When handling emergencies in OceanBase Database, pay attention to the following points:
Clearly record and verify each emergency change operation, which can serve as a reference for further diagnostics if the recovery fails.
We recommend that you record and back up relevant information during the emergency, including observer logs and core files generated during the abnormal period.
By default, the emergency recovery operations discussed in this chapter are performed without compromising the majority of the cluster data or compromising data consistency. If you need to perform operations such as activating a single replica or handling clogs in multiple replicas, contact OceanBase Technical Support.
After the emergency actions are completed and the business is stabilized, we recommend that you roll back the parameter changes and conduct a post-mortem analysis of the issue or failure based on the on-site information preserved during the emergency. This will facilitate the root cause analysis.