Overall architecture of OceanBase Database|V4.2.0| docs|Distributed Database

Overall architecture of OceanBase Database

Last Updated：2025-12-04 09:32:54 Updated

OceanBase Database adopts a shared-nothing distributed architecture, where nodes are equal to each other. Each node has its own SQL engine, storage engine, and transaction engine, and runs on a cluster of ordinary PC servers. This architecture offers benefits such as high scalability, high availability, high performance, cost-effectiveness, and strong compatibility with mainstream databases. For more information, see Deployment architectures.

Architecture of an OceanBase cluster

OceanBase Database uses commodity server hardware and relies on local storage. In a distributed deployment, the servers are also equal to each other and do not require special hardware. OceanBase Database adopts the shared-nothing architecture, and its SQL execution engine has distributed execution capabilities.

An observer process runs on each server as the operational instance of OceanBase Database. The observer process uses local files and transaction redo logs for data storage.

When you deploy an OceanBase cluster, you must configure availability zones (Zones). Each zone consists of multiple servers. A zone is a logical concept that represents a group of nodes that share hardware availability within the cluster. Its meaning varies depending on the deployment mode. For example, in a scenario where the cluster is deployed in one IDC, nodes in a zone may belong to the same rack or the same switch. If the cluster is deployed across IDCs, each zone can correspond to one IDC.

The data in a distributed cluster can be stored in multiple replicas for disaster recovery and read scaling. A zone is a tenant concept that represents the minimum storage unit of the tenant. A tenant stores only one replica of data in a zone. Multiple zones can store multiple replicas of the same data. The Paxos protocol ensures data consistency among replicas.

OceanBase Database has the multi-tenant feature. Each tenant is an independent database instance. You can set the distributed deployment mode of a tenant at the tenant level.

The components of an instance of OceanBase Database from the underlying layer to the upper layer are the multi-tenant layer, storage layer, replication layer, load balancing layer, transaction layer, SQL layer, and access layer.

Multi-tenant layer

To simplify the management of multiple business databases in a large-scale deployment and reduce resource costs, OceanBase Database offers the unique multi-tenant feature. In an OceanBase cluster, you can create many isolated "instances" of databases, which are called tenants. From the perspective of applications, each tenant is a separate database. In addition, each tenant can be configured in MySQL- or Oracle-compatible mode. After your application is connected to a MySQL-compatible tenant, you can create users and databases in the tenant. The user experience is similar to that with a standalone MySQL database. In the same way, after your application is connected to an Oracle-compatible tenant, you can create schemas and manage roles in the tenant. The user experience is similar to that with a standalone Oracle database. After a new cluster is initialized, a tenant named sys is created. The sys tenant is a MySQL-compatible tenant that stores the metadata of the cluster.

To isolate the resources of different tenants, each observer process can have multiple virtual containers belonging to different tenants. These containers are called units (UNIT). The units of each tenant across multiple nodes form a resource pool. The units include CPU and memory resources.

Storage layer

The storage layer provides data storage and access by table or partition. Each partition corresponds to a tablet (shard), and a non-partitioned table corresponds to one tablet.

The internal structure of a tablet is a layered storage structure with four layers: MemTables, L0-level mini SSTables, L1-level minor SSTables, and major SSTables. DML operations such as insert, update, and delete write data to MemTables. When the size of a MemTable reaches the specified threshold, the data in the MemTable is flushed to the disk to become an L0-level mini SSTable. When the number of L0-level mini SSTables reaches the threshold, multiple L0-level mini SSTables are merged into an L1-level minor SSTable. During off-peak hours of a day, the system merges all MemTables, L0-level mini SSTables, and L1-level minor SSTables into a major SSTable.

Each SSTable consists of several fixed-size macroblocks, each of which contains multiple variable-size microblocks.

The microblocks of a major SSTable are encoded during the merge process. The data in the microblocks is encoded by column, and the encoding rules include dictionary, run-length, constant, and differential encoding. After each column is compressed, the columns are encoded by using equal values or substrings. This encoding can significantly reduce the size of data and the internal features extracted from the data can further accelerate subsequent queries.

After the data is encoded, you can apply a general lossless compression algorithm to the data, further reducing the data size.

Replication layer

The replication layer uses log streams (LS, Log Stream) to synchronize the status among replicas. Each tablet corresponds to a specific log stream, and multiple tablets can correspond to one log stream. The redo logs generated by DML operations written to tablets are persisted in the corresponding log streams. Multiple replicas of a log stream are distributed across availability zones, and a consensus algorithm is maintained among the replicas. One of the replicas is elected as the leader, and the other replicas serve as followers. DML operations and strong-consistency queries are performed only on the leader.

Usually, each tenant has only one leader on each server, but can have multiple followers of other log streams. The total number of log streams of a tenant depends on the configurations of the primary zone and locality.

The leader uses the improved Paxos protocol to persist redo logs on the local server and send the redo logs over the network to followers. After each follower successfully persists the redo logs, it responds to the leader. The leader confirms that the redo logs are successfully persisted after a majority of followers persist the redo logs successfully. A follower uses the redo logs to replay its state in real time, ensuring consistency with the leader.

The leader is elected during the lease period. A normally working leader extends the lease period through elections within the lease period. The leader performs only primary tasks when the lease is valid. The lease mechanism ensures the robustness of exception handling of the database.

The replication layer can automatically respond to server failures to ensure the continuous availability of database services. If less than half of the servers hosting followers fail, namely, more than half of the followers are running, the database services are not affected. If a server hosting the leader fails, the lease of the leader cannot be extended. After the lease expires, other followers elect a new leader through elections and the database services can be restored.

Equilibration layer

When you create a table or a partition, the system selects appropriate log streams to create a tablet based on the equilibration principle. When the attributes of the tenant change, new machine resources are added, or tablets are unevenly distributed across machines after a long period of use, the equilibration layer will equilibrate data and services across servers by splitting and merging log streams and moving log stream replicas during the process.

If your tenant gains more server resources through scaling up, the equilibration layer will split existing log streams in the tenant and select an appropriate number of tablets to split into new log streams. Then, the new log streams will be migrated to the new servers to make full use of the scaled-up resources. If your tenant undergoes scaling down, the equilibration layer will migrate log streams from the servers to be reduced to other servers and merge them with the existing log streams on the other servers to reduce the resource occupation of the servers.

With the long-term use of the database, tables are created or deleted and more data is written. Even if the number of server resources remains unchanged, the original equilibrium may be destroyed. Typically, when a batch of tables is deleted, the tablets on the servers where the tables are originally aggregated are fewer. The tablets on other servers are more. In this case, the tablets should be evenly distributed to these servers. The equilibration layer will generate an equilibration plan on a regular basis. It splits log streams on servers with many tablets into temporary log streams, carrying the tablets to be moved. Then, the temporary log streams are migrated to destination servers and merged with the log streams on the destination servers, so as to achieve equilibrium.

Transaction layer

The transaction layer ensures the atomicity of DML operations in a single log stream or multiple log streams and the multiversion isolation between concurrent transactions.

Atomicity

The write-ahead logs in a log stream guarantee the atomicity of transactions modifying multiple tablets. When a transaction modifies multiple log streams, each log stream generates and persists its own write-ahead log. The transaction layer ensures the atomicity of the transaction through an optimized two-phase commit protocol.

When a transaction that modifies multiple log streams initiates commit, it selects one of the log streams as the coordinator. The coordinator communicates with all log streams modified by the transaction to check whether the write-ahead logs are persisted. When all log streams complete persistence, the transaction enters the commit state. Then, the coordinator drives all log streams to write the commit log of the transaction, indicating the final commit status of the transaction. During log stream replay or database restart, transactions that have been committed will determine the status of logs in their log streams based on the commit logs.

In the case of a crash and restart, if the transaction was not completed before the crash, it may have written the write-ahead log but not the commit log. Each log stream's write-ahead log contains the list of all transaction logs in the log stream. Based on this information, the system can re-identify the coordinator, restore the state of the coordinator, and proceed with the two-phase commit protocol until the transaction reaches the final commit or abort status.

Isolation

The Global Timestamp Service (GTS) is a service that generates continuous and increasing timestamps in the tenant. It ensures availability through multi-replica deployment. The mechanism is the same as the log stream replica synchronization mechanism described in the preceding Replication layer section.

Each transaction obtains a timestamp from the GTS as the commit version number when it is committed, and the timestamp is persisted in the write-ahead log of the log stream. Data modified in the transaction is marked with the commit version number.

Each statement (for the Read Committed isolation level) or transaction (for the Repeatable Read and Serializable isolation levels) begins with a timestamp request from the GTS as the read version number. When data is read, transactions skip data modified in transactions with higher version numbers and choose the latest data among data of version numbers lower than the read version number. This way, unified global data snapshots are provided for read operations.

SQL layer

The SQL layer converts an SQL request into data access operations on one or more tablets.

Components of the SQL layer

The SQL layer executes a request through the following steps: parser, resolver, transformer, optimizer, code generator, and executor. Specifically:

The parser is responsible for lexical or syntax parsing. It divides the SQL request into individual "tokens" and parses the entire request based on predefined syntax rules to generate a syntax tree (Syntax Tree).
The resolver is responsible for semantic parsing. It parses the SQL request based on database metadata to translate the tokens in the SQL request into corresponding objects (such as databases, tables, columns, and indexes). The data structure generated by the resolver is called a statement tree.
The transformer is responsible for logical rewriting. It uses internal rules or a cost model to rewrite the SQL statement into an equivalent one, and provides the statement to the optimizer for further optimization.

The transformer performs equivalent transformations on the statement tree to generate a new statement tree.
The optimizer generates the optimal execution plan for the SQL request. It considers the semantics of the SQL request, the characteristics of the objects' data, and the physical distribution of the objects to select the access path, join order, join algorithm, and distributed execution plan, and ultimately generate an execution plan.
The code generator converts the execution plan into executable code without making any optimization choices.
The executor initiates the execution process of the SQL request.

In addition to the standard SQL process, the SQL layer also supports the plan cache feature. It caches execution plans in memory. Subsequent executions can reuse the plans without undergoing the query optimization process again, thus avoiding repeated query optimization. The fast parser module directly parameterizes text strings by using lexical analysis and returns the parameterized text strings and constant parameters. This allows SQL requests to directly hit the plan cache, accelerating the execution of frequently executed SQL requests.

Types of plans

The SQL layer supports local, remote, and distributed execution plans. A local execution plan accesses data only on the local server. A remote execution plan accesses data only on a remote server. A distributed execution plan accesses data on more than one server. An execution plan is decomposed into multiple sub-plans for execution on multiple servers.

The SQL layer supports parallel execution. An execution plan is decomposed into multiple parts, and executed by multiple execution threads. The threads take advantage of scheduling mechanisms to achieve parallel execution of the plan. Parallel execution fully leverages the CPU and I/O resources of servers to shorten the response time of queries. Parallel queries can be performed on distributed execution plans or local execution plans.

Access layer

OceanBase Database Proxy (ODP), also known as OBProxy, is a component of the access layer of OceanBase Database. It forwards user requests to an appropriate OceanBase instance for processing.

ODP is a separate process deployed independently of OceanBase instances. ODP listens on network ports and is compatible with the MySQL network protocol. Applications can directly connect to OceanBase Database using a MySQL driver.

ODP can automatically discover the tenant and data distribution information of the OceanBase cluster. For each SQL statement proxied, it can identify the data that the statement will access as much as possible, and directly forward the statement to the OceanBase database instance on the server where the data is located.

ODP supports two deployment modes. It can be deployed on each application server or on the servers where OceanBase instances reside. In the first deployment mode, applications directly connect to ODP deployed on the same server. All requests are forwarded by ODP to an appropriate OceanBase server. In the second deployment mode, multiple ODPs need to be aggregated into one entry address for applications. This aggregation can be achieved by using a network load balancing service.