Blog

7 Key Technologies to Ensure High Availability in OceanBase Database

PRODUCT

ENGINEERING

7 Key Technologies to Ensure High Availability in OceanBase Database

OceanBase

2023-02-26

Product|Engineering

System Availability|Database|Distributed Systems|Multitenancy|Paxos

oceanbase database

Photo by Taylor Vick on Unsplash

This article describes the high-availability technologies used in OceanBase Database and answers the following questions about OceanBase Database:

A distributed system is immune to on-off faults, such as downtime, but suffers from non-on-off faults, such as NIC packet loss and I/O hanging. How does OceanBase Database address this pain point?
A distributed system usually has a “thick” client to solve routing problems. How does the OceanBase Database client solve routing and disaster recovery problems in various scenarios?
A distributed system usually has a large number of RPC interactions between nodes. How does OceanBase Database maintain the blacklist?
A multi-tenancy system usually has the pain point of resource preemption during peak hours. How does OceanBase Database isolate resources to tenants from affecting each other?
A distributed system manages its metadata in a master structure or a P2P structure. How does OceanBase Database manage metadata and ensure high availability?
SQL statements cause most database failures. How does OceanBase Database support SQL tuning?
How does OceanBase prevent data errors caused by program or hardware issues?

Background

Physics solves two problems: What substances is the universe made of? How are these substances coupled together, namely by the four fundamental forces?

As a distributed database system, what problems does OceanBase Database solve? One is the scalability problem, and the other is the high availability problem.

Distributed storage systems share a set of commonalities in design: Data partitioning is used to implement horizontal scaling. However, this mechanism involves cross-partition coordination, which needs to be addressed by using the two-phase commit protocol (2PC). The multi-replica replication architecture is used to achieve high availability, which however brings about data consistency problems, which are addressed by using the distributed consensus protocol Paxos. The cluster topology architecture, data partitioning, and multi-replica architecture also bring about challenges such as the management of massive metadata and client routing. The distributed architecture also introduces problems such as member changes, dynamic load balancing, resource isolation, and blast radius issues.

This article describes the solutions of OceanBase Database for common issues of distributed storage systems and its highly available design.

Multi-replica architecture

To achieve high availability, traditional databases adopt either the primary/secondary architecture or the shared storage architecture. The primary-secondary architecture has a trade-off between strong consistency and high availability. For example, Oracle DataGuard allows you to choose the maximum protection mode. When the standby database is unavailable, the primary database becomes unavailable as well, resulting in low availability. If you choose the maximum availability mode, data may be lost when the primary database is abnormally switched to the standby mode, which is unacceptable. This is a typical CAP trade-off.

In a shared storage architecture, such as Oracle RAC, the business commits transactions to an instance, which writes data and transaction logs to the shared storage. When an instance fails, its business traffic can quickly be taken over by another instance due to its statelessness. This solution is costly and cannot be deployed across IDCs because it depends on shared storage. It also involves a trade-off between strong consistency and high availability.

OceanBase Database adopts the Paxos protocol and can be deployed on normal cost-effective servers. Data is written to more than half of the servers (>=3). Therefore, no data will be lost when a minority of servers fail, ensuring a recovery point objective (RPO) of 0. If the primary database fails, the remaining well-functioning servers can automatically elect a new primary database to continue service provisioning within a short period without the need for external tools or manual intervention, ensuring a recovery time objective (RTO) of less than 30 seconds. The design helps OceanBase Database achieve a balance of strong consistency and high availability. OceanBase Database supports flexible deployment modes, including three IDCs in the same city, three IDCs across two regions, and five IDCs across three regions, to help you implement lossless disaster recovery capabilities at all levels, including the IDC level and city level.

A distributed system is susceptible to non-on-off faults, such as NIC packet loss caused by a hardware failure, and a “ping-pong effect” where the elected leader is switched back and forth between a normal OBServer and an abnormal OBServer due to the leader priority configuration on RootService.

To solve these problems, OceanBase Database uses an election blacklist. After a replica is elected as the leader, it will no longer be elected as a leader for a specified period of time, which is controlled by global parameters.

For a cluster deployed across multiple regions, the leader is preferentially switched to other replicas in the same region to avoid a remote leader switchover. If all other replicas in the same region are in the election blacklist, this rule is then ignored. External system administrators or DBAs need to physically isolate and replace servers in the election blacklist as soon as possible to avoid business interruption upon the expiration of the blacklist.

If an OBServer is unstable, it may fail to provide services from time to time and may have various abnormal conditions, such as slow response. If a client connects to this OBServer, the client may suffer from unstable database services. OceanBase Database provides the stop server command to isolate such OBServers. Upon isolation, these OBServers no longer provide external services, and client requests are no longer routed to them. This is a powerful tool for O&M and emergency operations, allowing DBAs to perform diagnostics, server replacement, and maintenance operations in a secure way.

The stop server command checks whether all the remaining replicas meet the majority requirement and whether logs are synchronized. If the conditions are not met, the stop server command returns a failure. If the conditions are met, the stop server command returns success after all the leaders on the faulty server are switched to normal servers. OceanBase Database also provides the isolate server command, which allows you to isolate servers. Compared with the stop server command, the isolate server command only switches all leaders on the faulty server to other servers. It can avoid execution failures of the stop server command in some extreme scenarios.

The stop server command can be used to kill observer processes, but the isolate server command cannot.

OceanBase Database adopts the checksum verification mechanism to address bad data blocks brought about by “silent errors”, which are caused by hardware issues during data persistence. Data persistence in OceanBase Database includes the persistence of Redo logs and SSTables. The checksum of each Redo log is recorded in its header. During network transmission and log replay, the checksum of each log is verified by force to ensure that the Redo logs to which the three replicas are synchronized and the data replayed to the memory are 100% correct. An SSTable is composed of many macroblocks. Each macroblock has a fixed length of 2 MB and is composed of many microblocks. A microblock has a variable length, which defaults to around 16 KB. A macroblock is the smallest write I/O unit, and a microblock is the smallest read I/O unit.

The header of each microblock records the checksum of the entire microblock. Each time when a microblock is read, its checksum is verified. The header of each macroblock records the checksum of the entire macroblock. During data replication and migration, the checksum of each macroblock is verified to prevent the spread of silent errors. This mechanism has prevented bad data blocks caused by disk firmware bugs many times in the Ant Financial environment.

To address data errors caused by program bugs during major compactions, OceanBase Database also adopts the checksum verification mechanism. The row checksum represents the cumulative checksum of the row data in the SSTable. The column checksum represents the checksum of the column data in the SSTable. During a major compaction, the database checks whether the column checksum of the index column is consistent with that of the corresponding column in the primary table to ensure data consistency between the primary table and the indexed table. The database checks whether the row checksums and column checksums of the multiple replicas of each partition are consistent to ensure data consistency between these replicas. If a checksum error is found during the major compaction, the major compaction will be suspended immediately and the error bits in the major compaction will be reset, thereby introducing manual intervention. As an earlier version of the SSTable still exists, the business is not affected.

Data links

oceanbase database

OceanBase data links

Let’s take a look at the data links of the OceanBase Database. A data link in OceanBase Database connects the app, OBProxy, and OBServer. The app uses any MySQL driver to connect to the OBProxy to send requests. As a distributed database, OceanBase Database stores user data in partitions, with multiple replicas for each partition across multiple OBServers. The OBProxy forwards the user requests to the most appropriate OBServer for execution and then returns the execution result to the user.

Each OBServer can also forward routes. If an OBServer cannot execute a received request, it forwards the request to the correct one. This scenario involves remote execution and has a longer request response time (RT).

As a reverse proxy server of OceanBase Database, the OBProxy provides routing and disaster recovery from the client to the database, so that users are unaware of the distributed architecture of the database. An OBServer has a complete set of SQL engines, storage engines, and distributed transaction engines. It is responsible for parsing SQL statements to generate and execute physical execution plans and committing transactions based on the Paxos protocol. OBServers provide database services that feature high performance, high availability, and high scalability.

OBProxies and OceanBase Database are loosely coupled to tolerate incorrect routing in case the routing information on an OBProxy are not updated in time. OceanBase Database will forward the routing information again and notify the OBProxy to update the routing information when it returns the result.

Routing

Routing is a core feature of OBProxies. An OBProxy implements a simple SQL Parser module to parse the database name, table name, and hint in an SQL statement, so as to select the most suitable OBServer for the request according to the SQL statement, routing rules, and OBServer status.

Strong consistency reads and DML statements are expected to be sent to the OBServer where the replica leader is located. Weak consistency reads can be sent to either the leader or a follower. For an OceanBase cluster deployed in multiple regions, OBProxies also provide the logical data center (LDC) routing mechanism. With this mechanism, user requests are sent to the OBServer in the same IDC whenever possible, and an OBServer in the same city stays next in the line, which is followed by an OBServer in other cities in priority. For an OceanBase cluster deployed in read/write splitting mode, OBProxies also provide rules such as RZone-first read, RZone-only read, and non-major-compaction-first read, for you to choose based on your business features. OBProxies also provide the anti-sequential fallback feature for weak reads.

The key to routing is to obtain the routing information. The OBProxy caches route information to obtain the locations of the replicas involved in user SQL statements. The route information is obtained synchronously at the first request. For subsequent requests, the OBProxy adopts the passive update triggering mechanism. With this mechanism, an update of the routing information is triggered by business SQL statements, and the routing information is refreshed asynchronously without an increase in the execution time of the business SQL statements. Routing information update is triggered in the following scenarios:

The actual routing is not as expected. For example, the leader is expected, while in fact a follower is selected; or a follower is expected, while in fact a leader is selected.
The server in the cluster has changed. For example, the server in the routing information no longer belongs to the cluster, or a server is added to or removed from the cluster.
A read-only replica has changed or a replica has been migrated. The OBProxy learns of the replica changes, for example, the addition of a read-only replica or migration of a follower, of a tenant from internal tables, and then invalidates all routing information of the tenant.
The OBServer returns a specific error, such as “tenant not in this server”.

Table routing information is cached in each OBServer and managed by the location cache. The location cache is a basic module on the OBServer. It caches the partition location information and replica-level metadata, which can be accessed by other modules of the OBServer, such as the storage, transaction, SQL statement, and OBProxy modules. In essence, the location cache maintains the server list and blacklist. The modules of an OBServer share one location cache.

The location cache refreshes and caches information passively based on the calls of these modules. The location cache supports two refresh modes: RPC refresh and SQL refresh. An RPC refresh features low resource consumption and can quickly detect the leader, the Paxos member list, and the location information of the read-only replicas directly cascaded under the Paxos members in this region. However, it cannot detect the changes of the read-only replicas that are cascaded in two or more levels, nor can it detect the changes of read-only replicas that are cascaded under other regions. An SQL refresh can detect the location information of all replicas, but features high resource consumption and relies on logic such as the meta table (described below), SQL engine, and underlying reports. In the case of network exceptions, the location cache fails to refresh. Therefore, the RPC refresh mode is preferred. An SQL refresh is periodically triggered by force upon successful RPC refreshes, to rectify the problem of inaccurate location information of read-only replicas. The location cache also supports batch refreshes, which optimizes the refresh speed in RTO scenarios and improves daily SQL execution.

The OBProxy can request the location cache of any OBServer, and preferentially send the request to all OBServers in the same IDC. This mechanism scatters traffic and protects the OBServer from being overloaded by massive synchronous location cache refreshes caused by business traffic surges upon an app restart.

Disaster recovery

The OBProxy must adapt to the OBServer access control during off-peak major compaction, upgrade, leader switchover, downtime, and start and stop of OBServers, to protect business traffic from being affected by hardware exceptions and internal traffic peaks. The OBProxy maintains blacklists for each cluster. The blacklists include a downtime blacklist and a live-but-unavailable blacklist.

When an OBProxy receives feedback from OBServers on dead nodes, it records the dead nodes in a down blacklist, which is periodically refreshed. A blacklisted OBServer is filtered and becomes inaccessible. The access is restored when the OBServer is removed from the blacklist after a certain refresh of the blacklist.

When an OBProxy forwards a request to an OBServer, and the OBServer returns a specific system error or does not respond for M times within N seconds, the OBServer is added to an alive-but-unavailable blacklist. If the zone is in the progress of a major compaction or an upgrade, all OBServers in the zone are added to the blacklist. When the major compaction or upgrade of the zone is completed and the zone status is updated, OBServers in the zone are removed from the blacklist and can be accessed again. The system tries to access an OBServer in an alive-but-unavailable blacklist once in a while to check whether the OBServer is available and can be removed from the blacklist.

The down blacklist has a higher priority than the alive-but-unavailable blacklist. The priority of a major compaction or an upgrade of a zone is higher than that of an OBServer. When no OBServers are available, the system forcibly accesses blacklisted OBServers.

RootService

RootService (RS) is not a single process, but a group of services that run on the OBServer where the leader of the __all_core_table table is hosted. When this leader becomes a follower, RS is stopped on this replica to ensure that only one RS is running in a cluster. The __all_core_table table is the first table generated when the cluster starts, and stores the information necessary for RS startup. All other partitions can be indexed based on the __all_core_table table. RS runs based on the sys tenant, which hosts all system tables and various backend services. RS manages metadata, cluster resources, and major compactions, and provides the point of entry for O&M commands.

RS manages the metadata of OBServers, monitors the status of each OBServer in the cluster based on the heartbeat mechanism, updates system tables in real-time, and handles exceptions. RS also transmits information such as changes in configurations and schemas to each OBServer by sending heartbeat packets.

RS manages the metadata of partitions, which are stored in system tables. These system tables are referred to as meta tables. The __all_root_table table contains the partition information of all system tables. The __all_tenant_meta_table table contains the partition information of all user tables.

RS maintains the accuracy of metadata by fetching the metadata based on RPC and executing scheduled tasks. The location cache caches the data of metatables and allows other internal modules of the OBServer and OBProxies to query the location and replica-level metadata. At present, the location cache supports passive refresh and will support active refresh soon. When the replica information changes, the location cache broadcasts the changes to all caches.

To avoid circular dependency caused by the queuing and thread pool models during location information fetch across multiple servers, separate queues are used for system tables at each level. In other words, separate queues and worker threads are used for the __all_core_table table, the __all_root_table table, system tables, and user tables.

oceanbase database

OceanBase RootService

RS manages cluster resources, such as leaders, partition load balancing, and resource unit load balancing. Leader management includes scenarios such as switching leaders of all partitions in a partition group to the same OBServer, leader switchover to the primary zone, and rotating compaction. Partition load balancing is the process of changing the distribution of partition groups in resource units, also known as units, of a tenant to minimize the load differences between the units. Unit load balancing is the process of scheduling the distribution of units across OBServers to minimize the load differences between OBServers.

Unlike a minor freeze, also known as minor compaction, which is performed on each OBServer, a major freeze, or a major compaction, is coordinated and initiated by RS and is a two-phase distributed transaction that involves RS and all partition leaders. A leaderless partition will cause a major freeze to fail. Major compaction can be triggered by business writes when the number of minor compactions reaches a specified threshold, which is controlled by a global parameter. Major compaction can also be triggered on schedule, such as the daily major compaction, which is often scheduled for off-peak hours. You can also manually trigger major compaction. RS also controls rotating compactions to reduce the impact of compactions on your business.

RS provides the entry point for the execution of management commands, such as BOOTSTRAP, ALTER SYSTEM, and DDL commands. The BOOTSTRAP command is performed during the system bootstrap to create system tables and initialize system configurations. DDL commands, such as CREATE TABLE, CREATE INDEX, and DROP TABLE, are not processed by the optimizer, but are sent to RS. Schema changes as a result of DDL commands is saved in the system tables and updated in memory, and then all online OBServers are notified of the new version before they refresh the schemas.

ALTER SYSTEM is a group of general O&M commands for general and emergency operations such as tenant creation, tenant resource or deployment adjustment, leader switchover, server stoppage or isolation, unit migration, execution plan binding, and throttling.

The multi-replica mechanism of OceanBase Database ensures that RS is not affected by the failure of a minority of nodes. However, RS still poses the risk of a single point of failure (SPOF). If bugs cause exceptions to RS, the availability of OceanBase Database services will be affected, which will lead to serious impacts such as the failure of building a connection to the client, exceptions of major compactions, and failure to respond to emergencies. The stability of RS is crucial to the stability of an OceanBase cluster. This is because RS is essentially the MetaServer of OceanBase Database, and the MetaServer is the Achilles heel of a distributed storage system. If an error occurs in the MetaServer, the whole cluster may fail. Therefore, it is necessary to reduce the path dependence and blast radius, and we have taken many actions to ensure RS robustness.

OceanBase Database supports a detection mechanism that can detect RS exceptions and run O&M commands to perform RS switchover and recover service. RS exceptions include leaderless RS, failure of leader RS switchover, stuck RS threads, exceptions in snapshot point recovery, parameter exceptions, full usage of worker threads, and full usage of DDL threads. These exceptions can be classified into the following two categories: RS exceptions that do not affect the request processing, and RS exceptions that make RS unavailable. When an exception of the former category occurs, you can run the stop server command to isolate the exception and switch RS to a healthy server. When an exception of the latter category occurs, you can use external administration tools to forcibly switch RS to a healthy server, and isolate the faulty server after the new RS starts working.

RS provides the entry point of command execution for you to collect system status data and run O&M and emergency commands. However, an exception of RS can cause this feature to fail. To address that issue, OceanBase Database provides a diagnostic tenant. The diagnostic tenant uses exclusive resources, such as login threads, worker threads, and memory, to ensure the query for monitoring metrics and the execution of simple O&M SQL statements after an exception occurs. This way, the sys tenant and user tenant can be protected from thread or memory exceptions.

Resource isolation

Multitenancy is a key feature of the OceanBase Database. A tenant is an important unit for the management of database objects and resources. An OceanBase cluster supports simultaneous access to multiple tenants within the cluster, in a way as if each tenant exclusively uses the cluster resources. Multitenancy brings many benefits. For example, it integrates multiple database instances into one database cluster, which makes database management easier at lower O&M costs. As business systems working in peak and off-peak hours of a day are deployed in the same cluster, the utilization of system resources can be maximized to support more tasks.

Multitenancy also causes challenges. The hardest is to ensure resource isolation among tenants. Exceptions of a single tenant must be isolated to ensure zero impact on other tenants and reduce the blast radius.

A tenant contains one or more units. Units of the same tenant are distributed on different nodes. One node hosts units of different tenants. A unit is a minimum granule for the allocation of CPU and memory resources. Each unit has independent resource specifications.

The kernel of OceanBase Database supports resource isolation between tenants on three levels. The resource quotas for units of the same tenant are dynamically adjusted across nodes to allocate more computing resources to the units that store more hotspot data. The resource quotas for units of different tenants are dynamically adjusted on the same node to allocate an appropriate portion of computing resources that matches the quotas of the tenants, and to release surplus quotas when the workload is light. Resources within a unit are isolated. For example, resources for large and small requests are isolated to prevent large requests from jamming the queue and exhausting worker threads and ensure that the system can respond to small requests in time.

The memory of OceanBase Database is divided into many areas, such as the SQL work area, MemStores, KV cache, and system memory. The SQL work area is the memory space used by operators during SQL statement execution. MemStores store the written data and cannot be allocated for other purposes. The KV cache caches SSTables and can be allocated for other purposes. The system memory is used to support the operation of various system threads. The memory areas for the SQL work area, MemStores, and KV cache of each tenant are isolated from those for other tenants. The system memory is shared by all tenants.

OceanBase Database is single-process software that runs with tenant threads and system threads. Tenant threads are worker threads that process SQL statements and transaction requests. A tenant thread is exclusively used by a tenant. System threads include EasyRPC threads that process network I/O, background timer threads, and threads that handle other system tasks, such as disk I/O, major compactions, writing clogs, and leader election. A system thread is shared by all tenants. The number of worker threads of a tenant depends on the unit specification of the tenant. The number of system threads is specified by the corresponding parameter.

An EasyRPC thread receives requests from the client, parses the tenant information, and then sends the request to the task queue of the tenant. A tenant thread consumes the requests in the queue and executes the SQL statement and transaction in the following process: parse the SQL statement > generate the execution plan > start_trans > start_stmt > execute > end_stmt > end_trans(). A two-phase commit (2PC) is performed during the execution of the end_trans() function, where the tenant and system threads work together to drive the transaction 2PC state machine to completion. During the transaction commit, the local logs are first written to the I/O buffer memory and then flushed into the disk by the ClogWriter thread. During the transaction commit, the RPC messages such as SQL messages, transaction messages, and Paxos log messages are sent by the EasyRPC thread.

The most common issue of a queuing system is that large requests may get jammed and interrupt the processing of small requests. For example, if large requests that are batch-processed by background tasks and small requests that are processed in transactions coexist in one tenant, the large requests may occupy all worker threads, which leads to the delay or even starvation of small requests. To address this issue, OceanBase Database supports queue prioritization.

The task queues of an OceanBase tenant are classified into large request queues and general queues. General queues are classified into multiple levels based on the request priority. Requests in each queue are processed by corresponding worker threads. You can specify the priority in a user request. In OceanBase Database, if the execution duration of an SQL query exceeds a certain threshold, the SQL query is labeled as a large request. The threshold is controlled by a global parameter. You can set the threshold to a value that is large enough to avoid mislabeling a general query as a large request due to jitters. The execution plan of a large SQL query is labeled as a large request plan. A subsequent SQL query that hits this plan is forwarded directly to the large request queue.

By default, OceanBase Database dedicates 30% of CPU resources to processing large requests. If a request is labeled as a large request for the first time during execution, the system checks whether CPU resources occupied by the thread that is processing the large request exceeds the specified threshold. If yes, the system hangs the thread and creates an additional worker thread to keep processing general requests.

In a distributed queuing system, a thread deadlock exception may occur when, for example, new RPC requests are initiated during the processing of an RPC request. This is because highly concurrent requests can exhaust worker threads and result in a thread deadlock. To address this issue, OceanBase Database supports multi-level nested queues. Tenant requests are leveled based on the RPC source. User requests are defined at level 0, requests between OBServers at level 1, RPC requests within an OBServer at level 3, and so on. Then, a different number of worker threads are created to process the requests at each priority level.

An RPC request from an OBServer is always one level higher than the current request and is processed by a thread at the next level, which avoids circular dependency. When the request level exceeds a certain threshold, an error is reported.

If a request waits too long in a queue or takes too long to execute, it must be aborted to avoid unnecessary consumption of server resources. This is because this time-consuming request may trigger various timeout errors and eventually fail regardless of whether the execution is completed. Moreover, when the request is waiting at the server, it locks the corresponding client thread, which may lead to the exhaustion of application threads and cause worse results. Therefore, a fail-fast mechanism is required to keep the client responsive. To address this issue, OceanBase Database supports attribute-based throttling. This feature allows you to throttle the requests that consume excessive resources by specifying the threshold values for the following four attributes: CPU utilization, I/O, network, and logical reads, or by specifying the threshold period of waiting at the server. If you select the former method, the requests are throttled when any attribute reaches its threshold value.

Attribute-based throttling is a complement to the time-based request classification, which considers only the execution duration and does not release the worker threads of the server and the client from being occupied after large requests are identified.

SQL execution

After an SQL statement is sent from the client, it is routed by an OBProxy and waits in a queue. Then, the system queries the plan cache to find a matching execution plan. If no plan in the plan cache is hit, the optimizer generates an optimized plan. Then, the plan is executed to return the result set.

Exceptions in any of these steps, such as the OBProxy sending the query to a remote OBServer, queue jamming, hard parsing of SQL statements due to small plan cache memory, and full table scan due to an inefficient execution plan, can result in a longer SQL execution period and hung application threads. This section describes the high availability design of execution plans.

An execution plan is an execution process generated by the optimizer for an SQL statement and is generally expressed as a tree of operators. The execution plan of an SQL statement describes the access path (access by the primary table or the secondary index), join order, join algorithms (NLJ JOIN, MERGE JOIN, or HASH JOIN), query rewrite logic and operators such as SORT and AGGREGATE. Execution plans are classified into local plans, remote plans, and distributed plans based on the number and hosting node of partitions involved in an execution plan. To generate an optimal execution plan for an SQL statement, the optimizer takes account of various factors, such as SQL semantics, data statistical characteristics, and data physical distribution, and supports rule-based and cost-based models.

The generation of an execution plan is a time-consuming step in the SQL execution process. To avoid repeated plan generation, the first generated execution plan is stored in the plan cache for reuse when the same SQL statement is executed again. You can run the Explain command to view the execution plan generated for an SQL statement in real time by the optimizer and query the internal table to view the execution plan in the plan cache that is actually used for the SQL statement.

OceanBase Database allows you to control the behaviors of the optimizer. You can control the optimizer to generate an execution plan the way you want by adding hints to the SQL statement. In addition, you can bind a plan to an SQL statement in progress by performing DDL operations to add a set of hints to the SQL statement. This way, the optimizer can generate a better execution plan based on the hints. The set of hints is referred to as an outline. You can create an outline for an SQL statement to facilitate plan binding.

You can also throttle an SQL query by adding an outline. Compared with the aforementioned attribute-based throttling, outline-based throttling is more precise, but needs more time to identify the SQL statement first.

OceanBase Database provides a comprehensive toolkit for SQL performance diagnostics. The sql_audit system view records the source, execution status, and statistics of each SQL query. The sql_trace view provides information about the execution process of the last executed SQL query and the duration of each phase. Plan cache-related views record the cache status of execution plans, execution statistics, and information about the plans. The slow query log records SQL queries whose total execution periods exceed a certain threshold and the duration of each phase in the latest execution. The tenant thread log records the request jam of each queue, the thread status, and the thread statistics of each tenant.

When RT jitter occurs, you can locate the time-consuming SQL query by using the sql_audit view. After you filter out the SQL queries whose total execution periods are increased due to extended waiting in a queue, you can further analyze the records of the query in the sql_audit view. If the query is retried many times, possible causes are lock contention and leader switchover. If it takes an excessive period of time to get the execution plan, it is likely that no plan in the plan cache is hit. If the execution takes excessive CPU time, you need to check for unusual wait events and perform specific analyses. If the number of logical reads is large, you need to analyze the plan cache-related views to determine whether the execution plan is optimal and whether you need to add outlines.

The execution plan generated by the optimizer may not be optimal due to flaws in the optimizer model, or inaccurate statistics. To address this issue, SQL Plan Management (SPM) and Adaptive Cursor Sharing (ACS) are introduced to OceanBase Database. SPM controls plan evolution by analyzing the performance of a new plan generated by the optimizer. If the performance is downgraded, the new plan is rejected. This ensures that the plan performance only gets better.

ACS allows the optimizer to store multiple execution plans for each parameterized SQL statement and select an appropriate plan based on the selectivity of predicates in the SQL statement. ACS is a great solution for the processing of small and large queries. It ensures that large queries are processed by using the primary table and small queries by using the indexed table, which avoids the high costs for table access by the primary key when large queries are processed by using the indexed table.

Summary

HA is a sophisticated feature that involves many details of a link. Any exception at any point may affect the system service. Therefore, it requires great patience to improve the feature one small step at a time. OceanBase Database V1.x was released in 2016 and the OceanBase open-source program was launched in 2021. However, we have been working on the HA technology since we started the OceanBase development in 2010. Respect for the arduous work of OceanBase pioneers.

Distributed databases are complex systems. It is impossible to go through all the HA details of the OceanBase Database in one article.

HA is also a feature that requires systemic support. Although OceanBase Database provides many mechanisms to prevent failures and ensure the high availability of the kernel, manual actions or external tools are necessary for system recovery from failures.

Looking forward to seeing your comments below!

System Availability Database Distributed Systems Multitenancy Paxos

OceanBase

Technology Trend | Product Interpretation | User Practice

Content