Overall architecture of OceanBase Database|V4.1.0|OceanBase Database| docs|Distributed Database

Overall architecture of OceanBase Database

Last Updated：2025-12-15 07:28:57 Updated

OceanBase Database adopts a shared-nothing distributed cluster architecture, where all nodes are equal. Each node has its own SQL engine, storage engine, and transaction engine, and runs on a cluster of ordinary PC servers. This architecture offers advantages such as high scalability, high availability, high performance, cost-effectiveness, and strong compatibility with mainstream databases. For more information, see Deployment architectures.

Architecture of an OceanBase cluster

OceanBase Database uses commodity server hardware and relies on local storage. In a distributed deployment, the servers are also equal and do not require special hardware. OceanBase Database adopts the shared-nothing architecture, and the SQL execution engine in the database supports distributed execution.

An observer process runs on each server as the operational instance of OceanBase Database. The observer process stores data and transaction redo logs in local files.

When you deploy an OceanBase cluster, you must configure availability zones (Zones). Each zone consists of multiple servers. A zone is a logical concept that represents a group of nodes in the cluster that share similar hardware availability. Its meaning varies depending on the deployment mode. For example, in a scenario where the cluster is deployed in one IDC, nodes in a zone may belong to the same rack or the same switch. If the cluster is deployed across IDCs, each zone can correspond to one IDC.

The data in a distributed cluster can be stored in multiple replicas for disaster recovery and read scaling. In the same tenant, data is stored in only one replica in a zone. Multiple replicas of the same data can be stored in different zones. The Paxos protocol ensures data consistency among replicas.

OceanBase Database has the built-in multi-tenant feature. Each tenant has its own distributed deployment mode at the tenant level. Resources such as CPU, memory, and I/O are isolated between tenants.

The database instances in an OceanBase cluster work together. They are organized from the bottom to the top by multi-tenant layer, storage layer, replication layer, balancing layer, transaction layer, SQL layer, and access layer.

Multi-tenant layer

To simplify the management of multiple business databases in large-scale deployments and reduce resource costs, OceanBase Database offers the unique multi-tenant feature. In an OceanBase cluster, you can create many isolated "instances" of databases, which are called tenants. From the perspective of applications, each tenant is a separate database. In addition, each tenant can be configured in MySQL- or Oracle-compatible mode. After your application is connected to a MySQL-compatible tenant, you can create users and databases in the tenant. The user experience is similar to that with a standalone MySQL database. In the same way, after your application is connected to an Oracle-compatible tenant, you can create schemas and manage roles in the tenant. The user experience is similar to that with a standalone Oracle database. After a new cluster is initialized, a tenant named sys is created. The sys tenant is a MySQL-compatible tenant that stores the metadata of the cluster.

To isolate tenant resources, each observer process can have multiple virtual containers belonging to different tenants, called resource units (UNIT). A resource unit includes CPU and memory resources. The resource units of each tenant on multiple nodes form a resource pool.

Storage layer

The storage layer stores and accesses data and metadata at the table or partition level. Each partition corresponds to a tablet (shard), and a non-partitioned table corresponds to one tablet.

The internal structure of a tablet is a multi-layer storage architecture. It consists of MemTables, L0-level mini SSTables, L1-level minor SSTables, and major SSTables. DML operations such as inserts, updates, and deletions write data to MemTables first. When the size of a MemTable reaches the specified threshold, the data in the MemTable is flushed to the disk to become an L0-level mini SSTable. If the number of L0-level mini SSTables reaches the threshold, multiple L0-level mini SSTables are merged into one L1-level minor SSTable. During off-peak hours of a day, the system merges all MemTables, L0-level mini SSTables, and L1-level minor SSTables into one major SSTable.

Each SSTable consists of several fixed-size macroblocks of 2 MB each. Each macroblock contains multiple variable-size microblocks.

The microblocks of a major SSTable are encoded during the merge process. The data in the microblocks is encoded by column, and then compressed. The compression rules include dictionary, run-length, constant, and differential encoding. After each column is compressed, the columns are further encoded by using techniques such as equal value or substring matching. This encoding can greatly reduce the size of data and accelerate future queries by refining columnar features.

After the data is encoded, a general lossless compression algorithm can be used to compress the data, further increasing the data compression ratio.

Replication layer

The replication layer uses log streams (LS, Log Stream) to synchronize the status among replicas. Each tablet corresponds to a specific log stream, and multiple tablets can correspond to one log stream. The redo logs generated by DML operations such as writes to the tablets are persisted in the corresponding log stream. Multiple replicas of a log stream are distributed across different availability zones. A consensus algorithm is maintained among the replicas. One of the replicas is elected as the leader, and the other replicas serve as followers. DML operations and strong-consistency queries are performed only on the leader.

Usually, each tenant has only one leader on each server, but can have multiple followers. The total number of log streams of a tenant depends on the configurations of the primary zone and locality.

The leader uses the Paxos protocol to persist redo logs on the local server and send the redo logs over the network to the followers. After each follower persists the redo logs, it responds to the leader. The leader confirms that the redo logs are persisted successfully on the majority of replicas and then confirms the persistence of the redo logs. A follower uses the redo logs to replay its state in real time, ensuring consistency with the leader.

The leader is elected during Paxos log persistence. A normally working leader continuously extends the lease within the lease period through the election protocol. The leader performs only primary tasks when the lease is valid. The lease mechanism ensures the robustness of exception handling in the database.

The replication layer can automatically respond to server failures to ensure the continuous availability of database services. If less than half of the servers hosting the followers crash, namely, more than half of the replicas are still working, the database services are not affected. If a server hosting the leader crashes, the lease of the leader cannot be extended. After the lease expires, other followers elect a new leader through the election protocol and are granted a new lease. Then, the database services can be restored.

Equilibration layer

When you create a table or a partition, the system selects appropriate log streams to create tablets based on the equilibration principle. If the attributes of the tenant change, if new machine resources are added, or if tablets are unevenly distributed across machines after a long period of use, the equilibration layer will equilibrate data and services across servers by splitting and merging log streams and moving log stream replicas during the process.

When the tenant expands and more server resources are available, the equilibration layer splits existing log streams and selects a suitable number of tablets to split into new log streams. Then, the new log streams are migrated to the new servers to make full use of the expanded resources. When the tenant contracts and some servers are to be shut down, the equilibration layer migrates log streams from the servers to be shut down to other servers and merges them with other servers' log streams to reduce the resource usage of the servers.

With the long-term use of the database, tables are created or deleted and more data is written. Even if the number of server resources remains unchanged, the original equilibrium may be destroyed. A common scenario is that when a batch of tables is deleted, the tablets on the servers where the tables are originally aggregated are fewer. The tablets on other servers can be evenly distributed to these servers to balance the load. The equilibration layer regularly generates equilibration plans, splits log streams on servers with many tablets into temporary log streams, and carries the tablets to be moved in the temporary log streams. The temporary log streams are migrated to the destination servers and merged with the local log streams on the destination servers to balance the load.

Transaction layer

The transaction layer ensures the atomicity of DML operations in a single log stream or multiple log streams and the multiversion isolation between concurrent transactions.

Atomicity

The write-ahead logs in a log stream guarantee the atomicity of transactions modifying multiple tablets. When a transaction modifies multiple log streams, write-ahead logs are generated and persisted in each log stream. The transaction layer ensures the atomicity of transactions through an optimized two-phase commit protocol.

When a transaction that modifies multiple log streams initiates a commit, it selects one of the log streams as the coordinator for the two-phase commit. The coordinator communicates with all log streams modified by the transaction to check whether the write-ahead logs are persisted. When all log streams complete the persistence, the transaction enters the commit state. Then, the coordinator drives all log streams to write the commit log of the transaction, indicating the final commit status of the transaction. During log replay or database restart, transactions that have been committed will determine the status of logs in their log streams based on the commit logs.

In the case of a database restart after a crash, if a transaction is not committed before the crash, it may have written its write-ahead log but not the commit log. Each log stream has write-ahead logs of all transactions. Based on this information, the system can re-determine the coordinator and restore its state and then proceed with the two-phase commit protocol until the transaction reaches the final commit or abort status.

Isolation

The Global Timestamp Service (GTS) is a service that generates continuous and increasing timestamps within a tenant. It ensures the availability of timestamps through multi-replica deployment. The mechanism is the same as the log stream replica synchronization mechanism described in the preceding Replication layer section.

Each transaction obtains a timestamp from the GTS as the commit version number for the transaction when the transaction is committed, and the timestamp is persisted in the write-ahead log of the log stream. Data modified in the transaction is marked with the commit version number.

A timestamp is also obtained from the GTS as the read version number for a statement or transaction when a statement is executed or a transaction begins, respectively, regardless of the isolation level. When data is read, transactions with timestamp larger than the read version number are skipped, and the data of the latest version smaller than the read version number is read. This way, uniform global data snapshots are provided for all read operations.

SQL layer

The SQL layer converts an SQL request into data access operations on one or more tablets.

Components of the SQL layer

To execute an SQL request, the SQL layer performs the following steps: parsing, resolution, transformation, optimization, code generation, and execution. Among them, the parser, resolver, transformer, optimizer, and code generator are referred to as the PLM components. The components work together to convert an SQL request into executable code. Here is a breakdown of the components:

The parser is responsible for lexical or syntactic parsing. It divides the SQL request into individual "tokens" and parses the entire request based on predefined syntax rules to generate a syntax tree (Syntax Tree).
The resolver is responsible for semantic parsing. It identifies the database objects (such as databases, tables, columns, and indexes) corresponding to the tokens in the SQL request based on database metadata. The data structures generated by the resolver are called statement trees.
The transformer is responsible for logical rewriting. It uses internal rules or a cost model to rewrite SQL requests into equivalent statements of other forms and provides them to the optimizer for further optimization.

The transformer performs equivalent transformations on the statement tree to generate a new statement tree.
The optimizer selects the best execution plan for an SQL request. It needs to consider multiple factors, such as the semantics of the SQL request, the characteristics of the objects' data, and the physical distribution of the objects, to determine the access paths, join order, join algorithms, distributed plans, etc., and ultimately generate an execution plan.
The code generator converts an execution plan into executable code without making any optimization choices.
The executor initiates the execution process of an SQL request.

In addition to the standard SQL process, the SQL layer also has the plan cache feature. It caches execution plans in memory. Subsequent executions can use the cached plans to avoid repeated query optimization. The fast parser module parameterizes text strings directly through lexical analysis and returns the parameterized text as well as constant parameters. This allows SQL requests to directly hit the plan cache, accelerating the execution of frequently executed SQL requests.

Types of plans

The SQL layer supports local, remote, and distributed execution plans. A local execution plan accesses only data on the local server. A remote execution plan accesses data on a remote server. A distributed execution plan accesses data on more than one server. An execution plan is decomposed into multiple subplans for execution on multiple servers.

The SQL layer supports parallel execution. An execution plan is decomposed into multiple parts, which are executed in parallel by multiple execution threads. The threads are scheduled to execute the plan in parallel. Parallel execution makes full use of the CPU and I/O resources of the servers to shorten the response time of queries. Parallel queries can be performed distributed execution plans or local execution plans.

Access layer

OceanBase Database Proxy (ODP), also known as OBProxy, is a component of OceanBase Database. ODP forwards user requests to an appropriate OceanBase instance for processing.

ODP is deployed as an independent process separate from OceanBase instances. ODP listens to network ports and is compatible with MySQL network protocols. Applications can directly connect to OceanBase Database using a MySQL driver.

ODP automatically discovers the tenants and data distribution of an OceanBase cluster. For each SQL statement it receives, ODP identifies the data to be accessed by the statement and forwards it to the OceanBase instance on the server where the data is located.

You can deploy ODP on each application server or on servers where OceanBase instances are deployed. In the first deployment mode, applications directly connect to ODP deployed on the same server. Then, ODP sends the requests to an appropriate OceanBase server. In the second deployment mode, you need to use a network load balancing service to aggregate multiple ODPs into one entry address for applications.