High availability solution|V2.2.77|OceanBase Database| docs|Distributed Database

High availability solution

Last Updated：2023-08-18 09:26:34 Updated

OceanBase Database provides high availability to ensure the reliability of the system. The high availability design of OceanBase Database considers the following two aspects:

When a node in OceanBase Database fails or is interrupted, OceanBase Database automatically restores the availability of the database. This shortens the impact time for business and avoids business interruption caused by node failure.
The failure of a few nodes will prevent the reading of data on these nodes. In this case, OceanBase Database ensures that the service data is not lost.

Conventional databases use solutions such as primary/standby database or one-primary-multiple-standby database solutions. However, OceanBase Database uses servers with high cost performance and relatively low reliability. OceanBase Database stores the same data on more than half servers among multiple (>= 3) servers. For example, it stores the same data on 2 servers among total 3 servers, or 3 servers among total 5 servers. Each write transaction only takes effect after more than half of the servers receive the transaction. Therefore, OceanBase Database prevents data loss when a few servers fail.

When the primary database fails, conventional databases usually need to upgrade the secondary database to the primary database by using external tools or manually. However, OceanBase Database adopts the Paxos protocol at the underlying layer. When the primary database fails, the remaining databases automatically elect a new primary database and continue to provide services.

Distributed election

OceanBase Database supports Paxos protocol groups based on partitions. Each partition has multiple replicas. OceanBase automatically create the Paxos protocol group by maintaining the member group relationship. In addition, OceanBase automatically selects the primary replica based on the partition-based Paxos protocol groups.

Each replica in the Paxos protocol group plays two different roles during normal operation:

Leader: The primary replica that provides highly consistent read and write services and many other capabilities, such as distributed transactions.
Follower: The secondary replicas that vote for the data synchronized by the leader and provide low-consistency read services.

The distributed election protocol maintains the authority of the leader through the Lease mechanism issued by the leader periodically. When a system failure occurs, such as the server failure where the leader replica is located or the network exception between the leader and the majority, the Leases recorded on the leader and follower expire. After the expiration, all connected follower replicas elect a new leader to provide read and write services.

The distributed election protocol supports leader role switchover. You can change the leader through the RootService or by running the "alter system" command.

The distributed election protocol provides multiple election priorities. The following election priorities apply to leader election and leader switchover:

The version number of the member group.
The integrity of the synchronization logs of the replica.
The primary region information set in Locality.
Some basic information of the replica, such as whether the disk is abnormal.

Multi-replica log synchronization

Paxos group members ensure data persistence through the strong synchronization among most REDO logs.

When the leader of a partition needs to commit a log, it replicates the log to the buffer of the local memory and sends this log to multiple followers asynchronously.

The replicated log is written to disks by the backend writing thread. Then, it is marked as disk-writing completed. Followers will send a confirmation message to the leader after logs sent to followers are written to the disk.

After the leader receives the disk-writing completion messages from the majority of followers and itself, it considers that this log has been strongly synchronized. It does not need to wait for feedback from other followers. At this point, the leader returns the successful commission message to the upper layer.

If a follower replica fails and causes log loss, the follower automatically recovers the lost log from the leader replica after replica recovery.

The whole log commission process only requires that most of the replicas are enabled.

Multi-log stream system

As mentioned earlier, each partition is an independent Paxos group and each Paxos group is an independent log stream.

In a Paxos group, each Paxos instance corresponds to a log. Each log has a unique number within the group called the LogID, which starts from 1. Such a group of logs is called a log stream.

Each OBServer process serves multiple partitions and these partitions together make up a multi-log stream system.

To reduce the disk write overhead, multiple log streams are written to the same log file.

Solutions to node failures

When a node failure occurs in an OBServer, the system performs the corresponding operations based on the value of the server_permanent_offline_time parameter, which is 3600s by default.

To avoid frequent data migration, OceanBase Database does not perform any operations if the duration of the failure is less than the value of this parameter.

If the failure time is greater than or equal to the set value, OceanBase Database marks the node as permanently deleted. OceanBase will generate replicas for all the partitions on the failed node. These replicas are still under the locality constraint of the corresponding tenant.