Overview|V4.3.5| docs|Distributed Database

Overview

Last Updated：2025-06-25 08:02:52 Updated

Concept

Replicas are a fundamental concept within the storage engine of OceanBase Database. From a user perspective, replicas are copies of the same data on different OBServer nodes.

From a database perspective, replicas are data partitions within OceanBase Database. Each data partition has multiple replicas based on tenant locality, providing high horizontal scalability and disaster recovery capabilities.

Data partitioning is the process of dividing a table or an index into smaller, more manageable parts based on specific rules for creating tables. Each partition is an independent object with its own name and optional storage characteristics.

Note

OceanBase Database is known for its multi-replica architecture that utilizes Paxos-based technology to ensure high availability. Replicas in this architecture consist of copies of the same data on different OBServer nodes. OceanBase Database stores data in containers of various dimensions, such as data partitions, log streams, units, and tenants. While replicas typically refer to data partition replicas, they may correspond to different database entities in various contexts.

Benefits

Replicas improve the availability and fault tolerance of OceanBase Database. Replicas can be distributed in different geographical locations to cope with network failures and data center failures.

OceanBase Database replicates data to multiple replicas through partition replication or log synchronization to prevent data loss. This way, OceanBase Database can provide lossless database services even if a minority of replicas fail.

Types

OceanBase Database adopts a hierarchical LSM-tree structure. Data is divided into baseline data and incremental data.

Baseline data is persisted as SSTables on disks and remains unchanged once generated.

Incremental data is written by users into MemTables and is stored in memory. Redo logs, also called commit logs (clogs), are used to ensure transactional performance.

Multiple replicas of the data are distributed across OBServer nodes. For example, in the deployment mode of three IDCs in the same region, three replicas are available, while five replicas are available in the deployment mode of five IDCs across three regions. During transaction commitment, redo logs are synchronized across OBServer nodes based on the Paxos protocol to ensure a majority of replicas are committed and data consistency is maintained between replicas.

The current version of OceanBase Database supports the following types of replicas:

Full-featured replicas

Full-featured replicas are also called standard replicas. The name of these replicas is FULL or F for short. A full-featured replica has a full set of data types and features, including redo logs, a MemTable, and an SSTable.

Full-featured replicas are classified into leaders and followers based on data partitions. Leaders mainly provide external write services and strong-consistency read services, as well as weak-consistency read services. Followers provide external weak-consistency read services. If the current leader fails, a follower can be quickly elected as the new leader to provide external services.

For more information, see Full-featured replicas.
Read-only replicas:

The name of read-only replicas is READONLY or R for short. Read-only replicas provide read capabilities and do not provide write capabilities. A read-only replica can serve only as a follower, and cannot participate in election or voting. In other words, a read-only replica cannot be elected as the leader of a log stream.

For more information, see Read-only replicas.
Columnstore replicas:

The name of columnstore replicas is COLUMNSTORE or C for short. In a columnstore replica, the baseline data of all user tables in the same log stream is stored in columnar storage mode. Here, the user tables can be replicated tables, but not index tables, internal tables, or system tables. For example, if you create a columnstore table in a full-featured replica, the columnstore table will be stored in columnar storage mode on the OBServer node where a columnstore replica resides. Like a read-only replica, a columnstore replica cannot participate in election or voting. However, a columnstore replica has a full set of features, including an SSTable, clogs, and a MemTable,.

For more information, see Columnstore replicas.

Log streams

Concept

Log streams are a collection of data automatically created and managed in OceanBase Database. A log stream includes data partitions, transaction operation logs, and transaction management structures. Redo logs, which run on the Paxos protocol, synchronize logs across replicas for data consistency and high availability. TxCtxMgr is a transaction management structure that allows for atomic commitment of modifications across all data partitions within a log stream. A transaction that spans multiple log streams is atomically committed based on the optimized two-phase commit protocol of OceanBase Database. Log streams are participants in distributed transactions.

Log stream

Log streams are a new concept introduced in OceanBase Database V4.0. A significant difference between OceanBase Database V4.0 and OceanBase Database V3.x lies in the basic unit for transaction commits.

In OceanBase Database V3.x, transactions are committed by partition. The write-ahead logging (WAL) mechanism ensures the atomicity of modifications within a partition. Each partition is a participant in two-phase commit, and the basic unit for transaction commit is a partition.
In OceanBase Database V4.x, transactions are committed by log stream. The WAL mechanism ensures the atomicity of modifications within a log stream. Each log stream is a participant in two-phase commit, and the basic unit for transaction commit is a log stream. This way, OceanBase Database is optimized in terms of resources, performance, and features.

Broadcast log streams

OceanBase Database introduces the concept of broadcast log streams in V4.2.0. When the first replicated table is created for a tenant, the system automatically creates a special log stream, which is called a broadcast log stream. Then, subsequent replicated tables of this tenant are all created in this broadcast log stream. A broadcast log stream differs from a general log stream in that the broadcast log stream automatically deploys a replica on each OBServer node of the tenant, to ensure that the replicated table can provide strong-consistency reads on any OBServer node in ideal conditions.

Generally, the more replicas participating in Paxos voting, the longer the time required for the majority of replicas to reach a consensus. For a tenant with many OBServer nodes, it is impossible for replicas on all OBServer nodes to participate in voting. Therefore, the broadcast log stream deploys a read-only replica on an OBServer node whose replica does not need to participate in voting, and deploys a full-featured replica on an OBServer node whose replica needs to participate in voting.

Differences between a broadcast log stream and a general log stream in terms of replicas are as follows:

A general log stream deploys only one replica in each zone, and the replica type must match the one specified in the locality.
In each zone, in addition to a replica of the type specified in the locality, a broadcast log stream also deploys a read-only replica on each OBServer node on which resources of the resource unit for the tenant are distributed. A broadcast log stream does not deploy any replica in a zone for which no replica type is specified in the locality.

Limitations of broadcast log streams are described as follows:

The sys tenant and all meta tenants do not have a broadcast log stream or support replicated tables.
A user tenant can have only one broadcast log stream.
Attribute conversion between a broadcast log stream and a general log stream is not supported.
A broadcast log stream cannot be separately deleted. At present, a broadcast log stream can only be deleted together with the corresponding tenant.

View the basic information of log streams

You can query the DBA_OB_LS view for the basic information of all log streams in the current tenant, including the log synchronization status and progress. Here are some examples:

View the information of general log streams

You can view the basic information of log streams in the sys tenant or a user tenant. Execute the following statement in the sys tenant. The result shows that the sys tenant has only one log stream with the ID 1.

SELECT * FROM oceanbase.DBA_OB_LS limit 10;

The return result is as follows:

+-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+
| LS_ID | STATUS | PRIMARY_ZONE                           | UNIT_GROUP_ID | LS_GROUP_ID | CREATE_SCN | DROP_SCN | SYNC_SCN | READABLE_SCN |
+-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+
|     1 | NORMAL | sa128_obv4_2;sa128_obv4_1,sa128_obv4_3 |             0 |           0 |       NULL |     NULL |     NULL |         NULL |
+-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+
1 row in set

View the information of a broadcast log stream

You can view the information of a broadcast log stream only in a user tenant. The sys tenant does not have a broadcast log stream. Execute the following statement in a user tenant. The result shows the information of the broadcast log stream of the user tenant. The replicated table is created in the broadcast log stream.

SELECT * FROM oceanbase.DBA_OB_LS WHERE flag LIKE "%DUPLICATE%";

The return result is as follows:

+-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+
| LS_ID | STATUS | PRIMARY_ZONE | UNIT_GROUP_ID | LS_GROUP_ID | CREATE_SCN          | DROP_SCN | SYNC_SCN            | READABLE_SCN        | FLAG      |
+-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+
|  1003 | NORMAL | z1;z2        |             0 |           0 | 1683267390195713284 |     NULL | 1683337744205408139 | 1683337744205408139 | DUPLICATE |
+-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+

View the location information and role information of a log stream

The location information of a log stream records the OBServer nodes on which the log stream is distributed. You can query the distribution of full-featured replicas and read-only replicas in the log stream respectively from the MEMBER_LIST and LEARNER_LIST fields in the oceanbase.DBA_OB_LS_LOCATIONS view. No independent location information is provided for data partitions. Instead, the locations of data partitions are determined by the locations of log streams to which the data partitions belong. You can migrate and replicate log streams across OBServer nodes for performance balancing and disaster recovery.

The role information of a log stream defines whether the log stream is a leader or a follower. You can query the role information of log streams from the ROLE field in the oceanbase.DBA_OB_LS_LOCATIONS view. No independent role information is provided for data partitions. Instead, the roles of data partitions are determined by the roles of log streams to which the data partitions belong. The roles of log streams are determined based on the election protocol.

For more information about the oceanbase.DBA_OB_LS_LOCATIONS view, see DBA_OB_LS_LOCATIONS.

View the mapping between data partitions and log streams

You can query the mapping between data partitions and log streams of a tenant from the DBA_OB_TABLE_LOCATIONS view. Each replica of a data partition records the basic information of the data partition and the information about the log stream to which the data partition belongs.

For more information about the oceanbase.DBA_OB_TABLE_LOCATIONSview, see DBA_OB_TABLE_LOCATIONS.