Single point of failure (SPOF) is a major issue in the design of distributed systems that must be handled at the earliest opportunity. If you want the system to work as expected after a node fails, you must deploy multiple replicas on the node in leader/follower mode. The leader replica is elected among multiple replicas by using the election protocol. If the leader replica fails, the system automatically switches over to a follower replica by using the election protocol.
The leader replica is called a leader, and the follower replicas are called followers.
The election protocol for distributed systems must meet the following requirements:
Correctness
When a replica considers itself a leader, the other replicas should not consider themselves a leader at the same time. The situation where two replicas consider themselves a leader is called split-brain. The election mechanism of the Raft protocol elects only one leader for each term to prevent split-brain issues. However, the Native Raft protocol may elect multiple leaders at the same time. Leaders are assigned to different terms, and leaders of smaller terms may have expired without awareness. If the native Raft protocol is used, the system must read the content of the majority of replicas to ensure that the most recent data is read. OceanBase Database provides the Lease mechanism to avoid access to the majority of replicas. This ensures that only one replica considers itself a leader at a time.
Activeness
If the majority of replicas in the cluster are active even if the leader is down, a replica among the active ones should be elected as a leader in a limited period.
On the basis of correctness and activeness, the election protocol of OceanBase Database provides a priority mechanism and a switchover mechanism. The priority mechanism ensures that, if no leader exists, the replica with the highest priority among available replicas is selected as the leader. The switchover mechanism ensures that the leader can be switched to a specified replica.
Lease period
When the lease expires, the leader cannot be connected to the majority of replicas, and a new round of elections will be initiated to elect the next leader.
The default lease period is 10s. You can modify the value as needed.
Notice
A longer lease period means higher tolerance for message delay and jitter, and better robustness, but will also result in a longer downtime recovery time.
A shorter lease period means lower tolerance for message delay and jitter, and poorer robustness, but will have a shorter downtime recovery time.
Priority
During the election process, the leader can get the priorities of the followers upon lease renewal. When the priority of a follower is higher than that of the leader, the leader will actively switch over the leader role to the follower.
The priority from high to low includes:
The user-specified leader.
The leader zone.
Leader switch
You can use the ALTER SYSTEM SWITCH REPLICA statement to control the distribution of log stream leaders. The leader role is switched in a smooth way internally, so that the external is unaware that the position of the leader has changed.
You can also change the primary zone. In that case, the leader will be aware of the change of the primary zone during the election and will switch the leader role over to the target follower.