Concepts of Replica
In OceanBase Database, a replica is a copy of a data set in the storage engine. Here, the data refers to a user-level concept.
At the OceanBase Database level, data is partitioned, and each partition is redundantly stored in multiple regions based on the locality attributes of the tenant. This approach ensures high horizontal scalability and a higher level of disaster recovery capability.
A data partitioning divides a table or an index into smaller, more manageable pieces based on specified table creation rules. Each data partition is an independent object with its own name and optional storage characteristics.
Note
OceanBase Database is well known for its multi-replica architecture. The multi-replica architecture based on the Paxos protocol is the foundation for the high availability feature of OceanBase Database. The "replica" in a multi-replica architecture means a copy of the same data stored in different nodes. OceanBase Database provides multiple types of storage containers for data, such as data partitions, log streams, units, and tenants. The "replica" usually refers to "data partition replicas" in the general context. However, note that "replica" may refer to different database entities in different contexts.
Purpose of a replica
Replicas improve the availability and fault tolerance of OceanBase Database. You can create multiple replicas of a table in different regions or zones to protect against network or data center failures.
OceanBase Database replicates data across multiple replicas using techniques like partition replication and log synchronization to prevent data loss and ensure continuous, lossless database services even if a minority of replicas fail.
Replica types
The storage engine of OceanBase Database is based on a hierarchical LSM-tree structure, and data is divided into baseline data and incremental data.
Baseline data that is persisted to the disk and that will not be modified after it is generated is called an SSTable.
The incremental data is stored in memory, and users' writes are first written to the incremental data, which is called the MemTable. This data is then persisted via the RedoLog to ensure transactionality (also known as the CommitLog, or CLog).
These data redundancies are present in multiple instances (e.g., 3 instances in the same city and 5 instances in 3 regions). When a transaction is committed, the RedoLogs are synchronized across the nodes using the Paxos protocol to ensure consistency across the replicas.
OceanBase Database supports the following replica types in the current version:
A complete replica
A full replica also known as a standard replica, has the full name "FULL" and the short name "F". It contains all the complete data and features including RedoLog, MemTable, and SSTable.
In a fully featured replica, roles are defined for data partitions. These roles are Leader and Follower. The Leader mainly provides write and strong-consistency read services. It can also provide weak-consistency read services. The Follower provides weak-consistency read services. In the event of a Leader failure, a Follower can be quickly promoted to a Leader to continue providing services.
The full-featured replica is a mandatory replica. The number of full-featured replicas for a tenant must be greater than or equal to 1. For more information about the full-featured replica, see Full-featured replica.
Read-only replicas
The name of a read-only replica is READONLY, abbreviated as R. Unlike a full-featured replica, a read-only replica only provides read capabilities, without write capabilities. It can only serve as a follower replica in a log stream, without participating in elections or log voting. It cannot be elected as the leader replica for a log stream.
A read-only replica is an optional replica. You can choose to deploy one as needed. For more information about read-only replicas, see Read-only replica.
Columnstore replica
A columnstore replica is denoted by the name COLUMNSTORE, with C being its abbreviation. A columnstore replica refers to the scenario in which baseline data of all user tables within the same log stream is stored in columnar format. In this context, user tables refer to replicated tables only and exclude index tables, internal tables, and system tables. For example, if a user creates a rowstore table on a replica denoted by R, the same table will be stored in columnar format on the machine that hosts the replica denoted by C. Similar to a read-only replica, a columnstore replica does not participate in leader elections or log voting. Additionally, it possesses a complete set of SSTables, clogs, and MemTables.
A columnstore replica is an optional replica, usually used for analytic processing (AP) scenarios. Users can decide whether to deploy a columnstore replica based on actual business needs. For more information about columnstore replicas, see Columnstore replica.
In the AP scenario, for the detailed deployment method of a columnar replica, see Overview of OceanBase AP deployment.
Introduction to log streams
Overview
A log stream is an entity automatically created and managed by OceanBase Database. It represents a collection of data, including several data partitions, and the transaction logs and transaction management structures for these partitions. The redo log module is implemented based on the Paxos protocol, ensuring data consistency across replicas and achieving high availability. The transaction context manager (TxCtxMgr) is a transaction management structure. Modifications to all data partitions within a log stream can be atomically committed within the log stream. When a transaction spans multiple log streams, OceanBase Database uses its optimized two-phase commit protocol to ensure atomic commitment of the transaction. A log stream participates in distributed transactions.
The concept of a log stream was introduced in OceanBase Database V4.0. Compared with OceanBase Database V3.x, the most significant change in OceanBase Database V4.0 is the change in the basic unit of transaction commit, which brings significant value in terms of resources, performance, and functionality.
In OceanBase Database V3.x, transactions are committed at the partition level. Modifications within a partition are guaranteed to be atomic by the write-ahead logging (WAL) within the partition. Each partition participates in the two-phase commit protocol, and the basic unit of transaction commit is the partition.
In OceanBase Database V4.x, transactions are committed at the log stream level. Modifications within a log stream are guaranteed to be atomic by the WAL within the log stream. Each log stream participates in the two-phase commit protocol, and the basic unit of transaction commit is the log stream.
Broadcast log stream
Starting from OceanBase Database V4.2.0, the concept of a broadcast log stream has been introduced. When the first replicated table is created in a tenant, a special log stream called a broadcast log stream is automatically created. Subsequent replicated tables are created in this broadcast log stream. The key difference between a broadcast log stream and a regular log stream is that a broadcast log stream automatically deploys a replica on each OBServer node within the tenant, ensuring strong consistency reads for replicated tables on any OBServer node under ideal conditions.
Generally, when too many replicas participate in consensus voting, it takes longer to reach a majority decision. In a tenant with many OBServer nodes, it is impractical to have all replicas participate in voting. Therefore, a broadcast log stream deploys read-only replicas (R replicas) on OBServer nodes that do not need to participate in voting and full-featured replicas (F replicas) on nodes that do need to participate in voting.
The differences between a broadcast log stream and a regular log stream in terms of replicas are as follows:
For a regular log stream, each zone can have only one replica, and the replica type must match the one specified in the locality.
For a broadcast log stream, each zone has the replica type specified in the locality, plus an additional read-only replica on any machine within the zone that has tenant units. Zones without a specified replica type in the locality do not have any replicas.
The usage limitations of a broadcast log stream are as follows:
The
systenant and allMetatenants do not have broadcast log streams and do not support the creation of replicated tables.Each user tenant can have at most one broadcast log stream.
Attribute conversion between a broadcast log stream and a regular log stream is not supported.
Manual deletion of a broadcast log stream is not supported. Currently, a broadcast log stream is deleted only when the corresponding tenant is deleted.
View basic information about log streams
You can query the DBA_OB_LS view to obtain basic information about all log streams in the current tenant, including the status and log progress. For example:
View information about a regular log stream
Both the
systenant and user tenants can query basic information about their corresponding log streams. The following example is executed in thesystenant, which has only one log stream, log stream 1.obclient(root@sys)[oceanbase]> SELECT * FROM oceanbase.DBA_OB_LS limit 1;The result is as follows.
+-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+-----------+-----------+ | LS_ID | STATUS | PRIMARY_ZONE | UNIT_GROUP_ID | LS_GROUP_ID | CREATE_SCN | DROP_SCN | SYNC_SCN | READABLE_SCN | FLAG | UNIT_LIST | +-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+-----------+-----------+ | 1 | NORMAL | sa128_obv4_2;sa128_obv4_1,sa128_obv4_3 | 0 | 0 | NULL | NULL | NULL | NULL | | | +-------+--------+----------------------------------------+---------------+-------------+------------+----------+----------+--------------+-----------+-----------+ 1 row in setView information about a broadcast log stream
Only user tenants can query information about broadcast log streams. The
systenant does not have any broadcast log streams. The following example is executed in a user tenant, which has a broadcast log stream. Replicated tables are created in this log stream.obclient(root@mysql001)[oceanbase]> SELECT * FROM oceanbase.DBA_OB_LS WHERE flag LIKE "%DUPLICATE%";The result is as follows.
+-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+-----------+ | LS_ID | STATUS | PRIMARY_ZONE | UNIT_GROUP_ID | LS_GROUP_ID | CREATE_SCN | DROP_SCN | SYNC_SCN | READABLE_SCN | FLAG | UNIT_LIST | +-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+-----------+ | 1003 | NORMAL | z1;z2 | 0 | 0 | 1683267390195713284 | NULL | 1683337744205408139 | 1683337744205408139 | DUPLICATE | | +-------+--------+--------------+---------------+-------------+---------------------+----------+---------------------+---------------------+-----------+-----------+ 1 row in set
View the location and role information of a log stream
A log stream has location information, which records the nodes where it is distributed. You can query the MEMBER_LIST and LEARNER_LIST columns of the oceanbase.DBA_OB_LS_LOCATIONS view to obtain the distribution of full-featured replicas and read-only replicas of the log stream, respectively. Data partitions no longer have independent location information. Instead, their location is determined by the location of the log stream to which they belong. A log stream can be migrated or replicated between nodes to achieve performance balancing and disaster recovery.
A log stream has role information, which records whether it is a leader or a follower. You can query the ROLE column of the oceanbase.DBA_OB_LS_LOCATIONS view to obtain the role information of the log stream. Data partitions no longer have independent role information. Instead, their role is determined by the role of the log stream to which they belong. The role of a log stream is determined through an election protocol.
For more information about the oceanbase.DBA_OB_LS_LOCATIONS view, see DBA_OB_LS_LOCATIONS.
View the mapping between data partitions and log streams
You can query the DBA_OB_TABLE_LOCATIONS view to obtain the mapping between data partitions and log streams in the current tenant. Each record in the view corresponds to a replica of a data partition and contains the basic information of the data partition and the information about the log stream to which the data partition belongs.
For more information about the oceanbase.DBA_OB_TABLE_LOCATIONS view, see DBA_OB_TABLE_LOCATIONS.
