To provide data security and high availability in terms of data service, each partition is stored as multiple physical copies. Each copy is called a replica of the partition. Each replica may include three major types of data: static data stored in the SSTable on the disk, incremental data stored in the MemTable in the memory, and logs that record transactions. Several replica types are available depending on the types of data stored. This is to support the different business preferences in terms of data security, performance scalability, availability, and cost.
Full-featured replica
A general-purpose replica that has a full set of data types and features including transaction logs, a MemTable, and an SSTable. This replica can quickly upgrade to be the leader to provide services for external applications.
Log replica
A replica that contains only logs. It has no MemTable or SSTable. It provides log services for external applications and participates in log voting. It can facilitate the recovery of other replicas, but it can not upgrade to be the leader to provide database services.
Read-only replica
A replica that contains complete logs, a MemTable, and an SSTable. However, its logs are special. It does not participate in log voting as a member of the Paxos group. Instead, it works as a listener that replicates the logs from the Paxos group members and then replays the logs locally. When an application has low data reading consistency requirements, these replicas can provide the read-only services. They are not part of the Paxos group. Therefore they will not cause an increase in the transaction commit delay because the voting membership is not expanded.
The differences among the three replica types are detailed below:
| Type | LOG | MemTable | SSTable | Data security | Time for regaining leadership | Resource cost | Services | Name (abbreviation) |
|---|---|---|---|---|---|---|---|---|
| Full-featured replica | It has logs and participates in voting (SYNC_CLOG). | Yes (WITH_MEMSTORE) | Yes (WITH_SSSTORE) | High | Fast | High | Provides data reading and writing services as the leader and non-consistent reading services as a follower | FULL (F) |
| Log replica | It has logs and participates in voting (SYNC_CLOG). | No (WITHOUT_MEMSTORE) | No (WITHOUT_SSSTORE) | Low | Not supported | Low | Not readable or writable | LOGONLY (L) |
| Read-only replica | It has asynchronous logs. It is only a listener instead of a member of the Paxos group (ASYNC_CLOG). | Yes (WITH_MEMSTORE) | Yes (WITH_SSSTORE) | Medium | Not supported | High | Non-consistent reading | READONLY (R) |
Partitions created under a tenant are stored in resource units. These resource units determine the number and distribution of partition replicas.
In an OceanBase Database cluster, zones serve as the boundary for the replication of partitions. In general, each zone stores the replica for all data, and each partition has one and only one replica in any given zone. If a tenant's resource units consist of three zones distributed in a single region, partitions created under this tenant will have three replicas, with one in each zone. If a tenant's resource units consist of five zones distributed across three regions, partitions created under this tenant will have five replicas.
Among a partition's many replicas, one is elected as the leader. The leader provides database services, enables data reading and writing, and, through the Paxos protocol, synchronizes data modifications among the replicas through redo logs. The follower replicas replay the redo logs synchronized to them to remain consistent with the leader.
Election of the leader is determined by the primary_zone property of the tenant. You can specify this property when you create a tenant and execute an ALTER statement to modify it.
Example: creating a tenant and specify its primary_zone
The following statement creates a tenant with a primary_zone in zone1. primary_zone='zone1' means all partitions created in the local_warehouse tenant will have their leader located in zone1. Replicas in other zones may only be elected as the leader if zone1 fails.
create tenant tt1 replica_num = 1, primary_zone='zone1', resource_pool_list=('pool_large1');
You can specify multiple zones for primary_zone and specify their priority.
For example:
primary_zone='zone1,zone2' means the that replicas in zone1 and zone2 have the equal chance to be elected as the leader.
primary_zone='zone1;zone2;zone3' means that the leader is elected from replicas in zone1. If zone1 fails, the leader is elected from replicas in zone2.
When you set the primary_zone parameter, use a comma between two zones to mean that the zones are of the same priority and use a semicolon to mean that the zone before the semicolon has a higher priority than the one after it. You can also use commas and semicolons in conjunction. For example, primary_zone='zone1,zone2;zone3,zone4;zone5' means that replicas in zone1 and zone2 have the equal chance to be elected as the leader. When zone1 fails, the leader is elected from replicas in zone2. If both zone1 and zone2 fail, replicas in zone3 and zone4 have the equal chance to be elected as the leader.