The emergency procedure for troubleshooting a database exception is divided into the following steps: event notification, exception information collection, analysis and diagnosis, and decision-making. This topic describes methods that you can use to locate and troubleshoot the root cause of common exceptions.
Event notification
You can be informed of an emergency event in many ways, such as application alerts, database alerts, user feedback, and routine inspections. After you notice an emergency event, you must immediately notify the troubleshooter of the information about the event, including the impact, and collect error messages to support event analysis and processing.
Exception information collection
We recommend that you collect and sort the information about an exception based on the following dimensions:
Phenomenon: the information related to business, such as the failure to log in, submit the order, and open a specific page. You need to locate the database error based on the business alerts and monitoring information, such as the report from the data access layer (DAL) of the application, middleware, and clients.
Exception time: the specific point in time at which the fault or exception occurs, and the duration of the fault or exception.
Scope of the exception: Confirm whether the exception occurs in a single application or all applications and whether the exception affects all OceanBase clusters or a specific cluster or tenant. For a global exception, troubleshoot the hardware infrastructure. For a local exception, confirm the details of the specific cluster, tenant, database, or ODP.
Error logs: Collect the logs that are generated when the exception is triggered. You can collect logs generated by the application DAL and the client.
Changes: Confirm whether the business is changed or new business is released before and after the exception occurrence time. If yes, collect the specific information. In practice, a large number of faults and exceptions are caused by changes.
Impact on associated business: Confirm the impact scope of the exception and whether the exception can affect upstream and downstream systems if it persists. You can determine whether to downgrade tasks or throttle the traffic in advance based on the information available. This involves not only the database but also the overall strategy of site reliability engineering (SRE), which is not covered in this topic.
Analysis, diagnosis, and decision tree
As mentioned in the preceding sections, emergency events can be classified into the following two types: exceptions that are detected by applications, and exceptions that are identified based on database inspection and alerts. The events of the latter type indicate specific database exceptions. The following sections describe the emergency procedure in detail. This section describes the logic and methods for troubleshooting the errors and exceptions reported from the business side based on the step-by-step analysis.
Basic troubleshooting
Logic of troubleshooting
Perform a health check on the OceanBase cluster.
To prevent unavailing troubleshooting, you can exclude the most common exceptions before you focus on any emergency events. The normal operation of OceanBase Database may be affected by the following hardware and software exceptions:
NTP clock offset
Server failure
Full usage of the clog disk or data disk
Network jitter in the IDC
Faults of F5 load balancer, Linux Virtual Server (LVS), or other load balancing components
OceanBase Cloud Platform (OCP) provides the health check feature to help you perform the basic cluster health check. For more information, see OCP User Guide.
Check for any traffic surges when the exception occurs.
After the most common exceptions are excluded, you need to check whether the volume of external requests significantly increases when the exception occurs. Surging business traffic can exhaust resources.
Analyze database errors or exceptions related to applications
In the previous two steps, you can identify only a part of the exceptions, such as server failure, network disconnection, full disk, and traffic surges. Further analysis is required to confirm exceptions that are not noticeable, such as I/O exceptions, network jitter, data distribution changes, and the occurrence of rare business logic. In general, application-related database errors and exceptions are classified into the following types:
Exhaustion of the application connection pool
Application request timeout
Application connection failure
Application write failure
Application lock contention
The following sections describe the preceding exceptions.
Exhaustion of the application connection pool or application request timeout
The exhaustion of the connection pool is one of the most common database exceptions that can be noticed in the application. Many exceptions can trigger a series of consequences and result in a fully occupied connection pool, which means that the application cannot build a new connection with the database. Application request timeout is the direct cause of the exhaustion of the connection pool. The exceptions that can cause a request timeout are described below based on their occurrence frequency from high to low:
A slow SQL query is executed in the database, which results in prolonged response time and constant connection retries by the application.
In this case, you can either troubleshoot the execution plan of the slow SQL query or throttle the business traffic. For more information, see Exceptions of SQL queries.
Long response time of SQL queries without faulty execution plans being found, or long-running transactions. In this case, use one of the following methods:
View the session requests that are being executed, and analyze the transaction status based on the information provided from the business side. The possible causes are exceptions on an OBServer, such as I/O, memory, or network exceptions. You can try to isolate the suspected node and see whether the issue can be resolved. If the issue cannot be resolved, you can downgrade specific exceptions. For more information, see Other exceptions related to hardware and network.
If no hardware fault is identified, but the disk I/O load is high or the network interface controller (NIC) is overloaded at the node, you can troubleshoot the disk I/O and NIC load. For more information, see Overloaded NIC on an OBServer and High disk I/O on an OBServer.
Troubleshoot the issue of buffer tables, which causes exceptions in all databases of the log-structured merge-tree (LSM-tree) architecture. For more information, see Buffer tables.
Insufficient CPU or memory resources at a database node. Insufficient CPU leads to the accumulation of the tenant queue, and full usage of memory can cause write failure. Both exceptions lead to an increase in the SQL response time.
In this case, we recommend that you check the CPU and memory resources of the tenant, and scale up the tenant or take other measures as needed. For more information, see Accumulated request queue of a tenant and Full occupation of tenant memory.
Application connection failure
An application may fail to establish a connection to an ODP or an OBServer node. Compared with failures of connection establishment, connection timeout occurs more frequently.
Failed to establish a connection to an ODP.
If an application fails to establish a connection to an ODP, you must check whether the network trace between the application and the ODP is normal. Then, check whether load balancing devices configured for the ODP are free of faults. If yes, errors may occur in the ODP, or all threads of the ODP are used due to traffic increase. If all threads are used, an error message such as
too many sessionsis returned. For more information about the emergency procedure, see ODP faults and Full usage of ODP threads.Failed to establish a connection to an OBServer
Check whether the network trace between the ODP and OBServer is free of faults. In an environment with a stable network infrastructure, the network is normal in most cases. If no obvious indications of network faults are found, move on and check the connection to the OBServer node. You can troubleshoot the following exceptions:
Accumulated request queue of the sys tenant or a user tenant. You need to view the observer log to check whether the request queue of the sys tenant is accumulated. The connection with the sys tenant is necessary for processing requests in an OceanBase cluster. If exceptions occur in the sys tenant, the connection may fail. For more information about the emergency procedure, see Accumulated request queue of the SYS tenant.
High CPU utilization or I/O load at the node. You need to view the observer log to check whether the request queue of any user tenants is accumulated. Insufficient resources of a user tenant can also cause connection timeout. For more information about the emergency procedure, see Accumulated request queue of a tenant.
Resource occupation due to exceptions of SQL queries. For more information about the emergency procedure, see Exceptions of SQL queries.
Application write failure
Most data write exceptions in OceanBase Database are related to memory usage. The following sections describe the diagnosis of data write exceptions based on the frequency of occurrence from high to low:
Full usage of user tenant memory due to batch writes in a short period, such as batch import, correction, and deletion.
The error message
Over tenant memory limitsis returned for this exception. OceanBase Database is developed based on an LSM-tree architecture. All write operations consume memory resources, and the occupied memory space is released only after minor compactions or major compactions. If minor compactions lag behind the memory consumption, the memory space can be fully occupied and data write operations by the application may fail. For more information about the emergency procedure, see Full usage of tenant memory.Full usage of memory due to the blockage of minor compaction caused by long-running or suspended transactions.
If long-running or suspended transactions cannot be committed, the active memory cannot be frozen for conducting a minor compaction, which results in full memory usage. You can check for prolonged execution of statements, or execute SQL statements to identify long-running transactions.
Full memory usage due to the failure of memory allocation to the MemStore.
In this case, the memory space is more occupied by other modules than the MemStore due to exceptions of SQL queries. As a result, errors may occur during data writes, although the memory usage of the MemStore is not high. For more information, see Memory insufficiency or leakage of internal modules.
Full usage of the clog disk, which results in the asynchronization among the majority of Paxos group members, and leaderless partitions.
The clog disk stores transaction logs of OceanBase Database based on the write-ahead logging (WAL) mechanism. If the disk is full, the data write fails. For more information about the emergency procedure, see Full usage of OBServer clog disk.
Blocked major freeze due to RootServer exceptions.
This exception is rare. OceanBase Database freezes the major data version after the data amount compacted by minor compactions reaches the specified threshold, or before the daily major compaction. In other words, the active memory is frozen before it can be compacted by a minor or major compaction. In most cases, the freeze is complete within 1 second. However, it takes more time when exceptions occur, and you cannot perform data write when the freeze is in progress. In this case, you can manually switch over or restart the RootServer. For more information, see Blocked cluster freeze or major compaction.
Application lock contention
Row lock contention may occur when specific rows are updated by concurrent requests. This can not only prolong the lock wait time of requests but also result in the exhaustion of the application connection pool. Lock contention is caused by the following exceptions:
The row lock cannot be released due to the increase in the SQL response time. In this case, reduce the response time. For more information, see the preceding section: Exhaustion of the application connection pool or application request timeout.
The row is associated with long-running or suspended transactions.
Concurrent update of hotspot rows, which in turn increases the response time.
Exceptions that occur during the update of hotspot rows are shared by all relational databases because no optimal emergency procedure is available, except for traffic throttling and task downgrade. From the perspective of long-term optimization, you can adjust the transaction logic, such as executing the select for update wait 1 statement. OceanBase Database provides effective solutions for the optimization of concurrent update of hotspot rows. For more information about how to improve the concurrent update of a single row, see Early lock release.
Modification, rollback, and precautions
The preceding analysis is intended to help you understand the emergency procedure for handling common exceptions. You can now take response actions to restore the database service. However, take note of the following items when you perform the emergency procedure:
Record and review each step. This helps you in further diagnosis if the restoration fails.
Prepare backups as much as possible, including observer logs during the exception and core files.
The emergency procedure provided in this topic is expected to maintain the majority of the cluster data and ensure data consistency. If you need to activate a single replica due to the fault of multiple replicas, or process the clogs of multiple replicas, contact OceanBase Technical Support.
After you perform the emergency procedure and restore the service, we recommend that you roll back the changes of some parameters, and review the exception or fault elimination process. Then, you can start troubleshooting the root causes based on the obtained information.