Overall architecture of OceanBase Database|V4.2.2| docs|Distributed Database

Overall architecture of OceanBase Database

Last Updated：2025-12-02 02:50:56 Updated

OceanBase Database adopts a shared-nothing distributed cluster architecture, where all nodes are equal. Each node has its own SQL engine, storage engine, and transaction engine, and runs on a cluster of ordinary PC servers. This architecture offers advantages such as high scalability, high availability, high performance, cost-effectiveness, and strong compatibility with mainstream databases. For more information, see Deployment architectures.

Cluster architecture of OceanBase Database

OceanBase Database uses commodity server hardware and relies on local storage. In a distributed deployment, the servers are also equal and do not require special hardware. OceanBase Database adopts the shared-nothing architecture, and its SQL execution engine has distributed execution capabilities.

OceanBase Database runs an observer process on each server as the running instance of the database. The observer process stores data and transaction redo logs in local files.

When you deploy an OceanBase cluster, you must configure availability zones (Zones). Each zone consists of multiple servers. A zone is a logical concept that represents a group of nodes that share similar hardware availability in the cluster. Its meaning varies depending on the deployment mode. For example, in a scenario where the cluster is deployed in one IDC, nodes in a zone may belong to the same rack or the same switch. If the cluster is deployed across IDCs, each zone can correspond to one IDC.

The data in a distributed cluster can be stored in multiple replicas for disaster recovery and read scaling. In a zone, the data of a tenant is stored in one replica. Multiple replicas of the same data can be stored in different zones. The Paxos protocol ensures data consistency among replicas.

OceanBase Database has the multi-tenant feature. Each tenant has its own database instance. You can set the distributed deployment mode of a tenant at the tenant level.

The components of collaboration in an OceanBase database instance from the underlying layer to the upper layer are the multi-tenant layer, storage layer, replication layer, load balancing layer, transaction layer, SQL layer, and access layer.

Multi-tenant layer

To simplify the management of multiple business databases in large-scale deployments and reduce resource costs, OceanBase Database offers the unique multi-tenant feature. In an OceanBase cluster, you can create many isolated "instances" of databases, which are called tenants. From the perspective of applications, each tenant is a separate database. In addition, each tenant can be configured to be MySQL- or Oracle-compatible. After an application is connected to a MySQL-compatible tenant, you can create users and databases in the tenant. The user experience is similar to that with a standalone MySQL database. In the same way, after an application is connected to an Oracle-compatible tenant, you can create schemas and manage roles in the tenant. The user experience is similar to that with a standalone Oracle database. After a new cluster is initialized, a tenant named sys is created. The sys tenant is a MySQL-compatible tenant that stores the metadata of the cluster.

To isolate the resources of different tenants, each observer process can have multiple virtual containers, which are called resource units (UNITs), for different tenants. The CPU and memory resources of each tenant across multiple nodes form a resource pool.

Storage layer

The storage layer provides data storage and access on a table or partition basis. Each partition corresponds to a tablet (shard), and a non-partitioned table corresponds to one tablet.

The internal structure of a tablet is a layered storage structure with four layers: MemTable, L0-level mini SSTable, L1-level minor SSTable, and major SSTable. DML operations, such as insert, update, and delete, write data to the MemTable first. When the MemTable reaches the specified size, the data in the MemTable is flushed to the disk to become an L0-level mini SSTable. When the number of L0-level mini SSTables reaches the threshold, multiple L0-level mini SSTables are merged into an L1-level minor SSTable. During off-peak hours of a day, the system merges all MemTables, L0-level mini SSTables, and L1-level minor SSTables into a major SSTable.

Each SSTable consists of several fixed-size macroblocks, each of which contains multiple variable-size microblocks.

The microblocks of a major SSTable are encoded during the merge process. The data in the microblocks is encoded by column, and encoding techniques such as dictionary, run-length, constant, and differential encoding are supported. After each column is compressed, multi-column data is further encoded by using techniques such as equal value and substring encoding. This encoding can significantly reduce the size of data and the internal features of the data in columns can further accelerate subsequent queries.

After encoding and compression, user-specified general compression algorithms can be used for lossless compression to further improve the compression ratio.

Replication layer

The replication layer uses log streams (LS, Log Stream) to synchronize the status among replicas. Each tablet corresponds to a specific log stream, and multiple tablets can correspond to one log stream. The redo logs generated by DML operations written to tablets are persisted in the corresponding log stream. Multiple replicas of a log stream are distributed across availability zones, and a consensus algorithm is maintained among the replicas. One of the replicas is elected as the leader, and the other replicas serve as followers. DML operations and strong-consistency queries are performed only on the leader.

Usually, each tenant has only one leader on each server, but can have multiple followers. The total number of log streams of a tenant depends on the configurations of the primary zone and locality.

The leader uses the improved Paxos protocol to persist redo logs on the local server and send the redo logs over the network to followers. After each follower successfully persists the redo logs, it responds to the leader. The leader confirms that the redo logs are successfully persisted after a majority of followers persist the redo logs successfully. A follower uses the redo logs to replay its state in real time, ensuring consistency with the leader.

The leader is elected during a lease period. A normally working leader continuously extends the lease period through elections within the lease period. The leader performs only primary tasks during the lease period. The lease mechanism ensures the robustness of exception handling of the database.

The replication layer can automatically respond to server failures to ensure the continuous availability of database services. If less than half of the servers hosting followers crash, namely, more than half of the followers are still working, the database services are not affected. If a server hosting the leader crashes, the lease of the leader cannot be extended. After the lease of the leader expires, other followers elect a new leader through elections and the new leader is granted a new lease. Then the database services can be restored.

Equilibration layer

When you create a table or a partition, the system selects appropriate log streams to create tablets based on the equilibration principle. If the attributes of the tenant change, if new machine resources are added, or if tablets are unevenly distributed across servers after a long period of use, the equilibration layer will equilibrate data and services across servers by splitting and merging log streams and moving log stream replicas during the process.

When a tenant scales up and obtains more server resources, the balancing layer splits the existing log streams within the tenant, selects an appropriate number of Tablets to split into new log streams, and then migrates the new log streams to the newly added servers to fully utilize the expanded resources. When a tenant scales down, the balancing layer migrates the log streams from the servers that need to be reduced to other servers and merges them with the existing log streams on those servers to reduce the resource consumption of the machines.

With the long-term use of the database, tables are created or deleted and more data is written. Even if the number of server resources remains unchanged, the original equilibrium may be destroyed. A common scenario is that when a batch of tables is deleted, the tablets on the servers where the tables are originally aggregated are fewer. The tablets on other servers need to be evenly distributed to these servers. The equilibration layer will generate an equilibration plan on a regular basis. It splits log streams on servers with many tablets into temporary log streams, carrying the tablets to be moved. Then, the temporary log streams are migrated to destination servers and merged with the log streams on the destination servers, so as to achieve equilibrium.

Transaction layer

The transaction layer ensures the atomicity of DML operations in a single log stream or multiple log streams and the multiversion isolation between concurrent transactions.

Atomicity

The write-ahead logs of a log stream ensure the atomicity of transactions modifying the log stream, even if the transactions involve multiple tablets. When a transaction modifies multiple log streams, write-ahead logs are generated and persisted in each log stream. The transaction layer ensures the atomicity of the transaction through an optimized two-phase commit protocol.

When a transaction that modifies multiple log streams initiates a commit, the transaction selects one of the log streams as the coordinator for the two-phase commit. The coordinator communicates with all log streams that the transaction modifies to check whether the write-ahead logs are persisted. After all log streams complete the persistence, the transaction enters the commit state. Then, the coordinator drives all log streams to write the commit log of the transaction, indicating the final commit status of the transaction. During log stream replay or database restart, transactions that have been committed will determine the commit status of their log streams based on the commit logs.

In a crash restart, if a transaction is not committed before the crash, the transaction may have written its commit log but not the commit log of the transaction. Each log stream has write-ahead logs that contain the list of all transaction logs in the log stream. Based on this information, the system can determine the coordinator from the logs and restore the state of the coordinator to re-execute the two-phase commit protocol until the transaction reaches the final commit or abort status.

Isolation

The Global Timestamp Service (GTS) is a service that generates continuous and increasing timestamps in the tenant. It ensures availability through multi-replica deployment. The mechanism is the same as the log stream replica synchronization mechanism described in the preceding Replication layer section.

Each time a transaction is committed, it requests a timestamp from the GTS as the commit version number of the transaction and writes the timestamp into the write-ahead log of the log stream. Data modified in the transaction is marked with the commit version number.

Each time a statement is executed (for the Read Committed isolation level) or a transaction is started (for the Repeatable Read and Serializable isolation levels), the GTS provides a timestamp as the read version number for the statement or transaction. When data is read, transactions skip data modified in transactions with larger version numbers and select the latest data modified in transactions with version numbers smaller than the read version number. This way, the read operations are provided with a unified global data snapshot.

SQL layer

The SQL layer converts an SQL request into data access operations on one or more tablets.

Components of the SQL layer

The execution of a request in the SQL layer comprises the following components: Parser, Resolver, Transformer, Optimizer, Code Generator, and Executor. Specifically:

The Parser is responsible for lexical or syntax parsing. It breaks down the SQL request into individual "tokens" and parses the entire request based on predefined syntax rules to generate a syntax tree (Syntax Tree).
The Resolver is responsible for semantic parsing. It identifies the database objects (such as databases, tables, columns, and indexes) in the SQL request based on database metadata. The data structure generated by the Resolver, called a statement tree, is used to represent the semantic information of the SQL request.
The Transformer is responsible for logical rewriting. It uses internal rules or a cost model to rewrite an SQL request into an equivalent statement of another form and provides the statement to the optimizer for further optimization.

The Transformer performs equivalent transformations on the statement tree, which results in a new statement tree.
The Optimizer is responsible for generating an optimal execution plan for an SQL request. It considers the semantics of the SQL request, the characteristics of the object data, and the physical distribution of the objects to select the access path, join order, join algorithm, and distributed execution plan. Finally, an execution plan is generated.
The Code Generator is responsible for converting an execution plan into executable code without making any optimization choices.
The Executor is responsible for initiating the execution process of an SQL request.

In addition to the standard SQL process, the SQL layer also implements the plan cache feature to cache execution plans in memory. If a subsequent execution request uses the same plan, the system can repeatedly execute the plan without re-optimizing it. In this way, the system avoids the overhead of repeated query optimization. The fast parser module parameterizes a text string directly through lexical analysis and generates a parsed text string and constant parameters. This allows an SQL request to directly hit the plan cache, accelerating the execution of frequent SQL requests.

Types of plans

The SQL layer supports local, remote, and distributed execution plans. A local execution plan accesses only data on the local server. A remote execution plan accesses data on a remote server. A distributed execution plan accesses data on more than one server. An execution plan is decomposed into multiple sub-plans for execution on multiple servers.

The SQL layer supports parallel execution. An execution plan is decomposed into multiple parts, which are executed in parallel by multiple execution threads. With proper scheduling, the parallel execution implements the parallel processing of an execution plan. The parallel execution makes full use of the CPU and I/O resources of servers, thereby reducing the response time of queries. Parallel queries can be performed distributed execution plans or local execution plans.

Access layer

OceanBase Database Proxy (ODP), also known as OBProxy, is a component of OceanBase Database that sits at the access layer and forwards user requests to an appropriate OceanBase instance for processing.

ODP is a separate process deployed independently of OceanBase Database instances. ODP listens on network ports and is compatible with MySQL network protocols. Applications can directly connect to OceanBase Database using a MySQL driver.

ODP automatically discovers the tenants and data distribution of an OceanBase cluster. For each SQL statement it receives, ODP identifies the data to be accessed by the statement and forwards it to the corresponding OceanBase instance.

ODP supports two deployment modes. In the first mode, ODP is deployed on each application server that needs to access a database. In the second mode, ODP is deployed on the server where an OceanBase Database node resides. In the first mode, applications directly connect to the ODP deployed on the same server. All requests are forwarded by the ODP to an appropriate OceanBase server. In the second mode, multiple ODPs need to be load balanced to provide a unified entry for applications.