Based on the concept of partitioned tables in conventional databases, OceanBase Database divides data of a table into different partitions. In a distributed environment, OceanBase Database copies data of the same partition to multiple servers to ensure high availability of data read and write services. The data copy of the same partition on each server is called a replica. Strong consistency is ensured among multiple replicas of a partition based on the Paxos protocol. Each partition and its replicas constitute an independent Paxos group. One partition serves as the leader, and others serve as followers. The leader supports strong consistency reads and writes, and the followers support weak consistency reads.
Location cache
OceanBase Database organizes user data by partition. Each partition has multiple replicas for disaster recovery. To execute an SQL statement, OceanBase Database must obtain the partition locations. Then, OceanBase Database can locate the specific server to read data from or write data to the corresponding partition replica. Each observer process provides a service for refreshing and caching the partition locations required by the local server. The service is called a location cache service.
OceanBase Database persists the partition locations to built-in tables, which are called meta tables. To cope with cluster bootstrap, OceanBase Database hierarchically organizes and persists the locations of different types of tables to meta tables at different levels. These meta tables record different content:
__all_virtual_core_root_table: records the location of the__all_root_tabletable.__all_root_table: records the locations of all built-in tables in the cluster.__all_virtual_meta_table: records the locations of partitions of all user tables under all tenants in the cluster.
Replica types
In OceanBase Database, each partition is stored as multiple physical copies to ensure data security and high availability of data services. Each copy is called a replica of the partition. Each replica may include three major types of data: static data stored in the SSTable on the disk, incremental data stored in the MemTable in the memory, and logs that record transactions. Several replica types are available depending on the types of data stored. This is to support the different business preferences in terms of data security, performance scalability, availability, and costs.
OceanBase Database supports the following four types of replicas:
Full-featured replica (FULL/F)
Log-only replica (LOGONLY/L)
Encrypted voting replica (ENCRYPTVOTE/E)
Read-only replica (READONLY/R)
The full-featured replica, log-only replica, and encrypted voting replica are Paxos replicas that can join a Paxos group. The read-only replica is a non-Paxos replica that cannot join a Paxos group.
Distributed consensus protocol
OceanBase Database synchronizes transaction logs among replicas of the same partition based on the Paxos protocol. It commits transaction logs only when the logs are synchronized in the majority of replicas. The leader ensures strong consistency reads and writes by default. Followers support weak consistency reads, which allows you to read data of an earlier version.
Data balancing
OceanBase Database uses RootService to manage load balancing among the resource units of a tenant. Different types of replicas require different amounts of resources. RootService considers the CPU utilization, disk usage, memory usage, and IOPS of each resource unit when it manages partitions. To make full use of resources available on each server, RootService balances the usage of various resources among all servers after load balancing.
RootService balances data in the following two ways:
Replica balancing
RootService adjusts the resource usage of a tenant on different servers by migrating resource units or replicas or creating replicas.
Leader balancing
After replica balancing is achieved, RootService further balances leaders among servers based on the primary zone of the tenant. Specifically, RootService aggregates leaders to the same server to reduce distributed transactions and shorten the response time (RT) of business requests. Alternatively, RootService distributes leaders on multiple servers to make full use of server resources and increase the system throughput.