Leader election in distributed systems

Last Updated：2023-10-24 09:23:03 Updated

Single point of failure (SPOF) is a major issue in the design of distributed systems that must be handled at the earliest opportunity. If you want the system to work as expected after a node fails, you must deploy multiple replicas on the node in primary/standby mode. The primary replica is elected among multiple replicas by using the election protocol. If the primary replica fails, the system automatically switches over to a standby replica by using the election protocol.

The primary replica is called a leader, and the standby replicas are called followers.

The election protocol for distributed systems must meet the following requirements:

Correctness

When a replica considers itself a leader, the other replicas should not consider themselves a leader at the same time. The situation where two replicas consider themselves a leader is called split-brain. The election mechanism of the Raft protocol elects only one leader for each term to prevent split-brain issues. However, the Native Raft protocol may elect multiple leaders at the same time. Leaders are assigned to different terms, and leaders of smaller terms may have expired without awareness. If the native Raft protocol is used, the system must read the content of the majority of replicas to ensure that the most recent data is read. OceanBase Database provides the Lease mechanism to avoid access to the majority of replicas. This ensures that only one replica considers itself a leader at a time.
Activeness

If the majority of replicas in the cluster are active even if the leader is down, a replica among the active ones should be elected as a leader in a limited period.

On the basis of correctness and activeness, the election protocol of OceanBase Database provides a priority mechanism and a switchover mechanism. The priority mechanism ensures that, if no leader exists, the replica with the highest priority among available replicas is selected as the leader. The switchover mechanism ensures that the leader can be switched to a specified replica.

Assumptions and error tolerance

OceanBase Database assumes that the physical environment where it is running meets the following requirements:

The one-way network latency between any two OBServers is smaller than a specified upper limit, which is called MAX_TST.
The clocks of all OBServers in the current cluster are synchronous with the clock of the Network Time Protocol (NTP) server. The clock deviation between OBServers and the NTP server is less than the specified maximum deviation, which is called MAX_DELAY.

In OceanBase Database, MAX_TST is set to 200, and MAX_DELAY is set to 100. The units of the two system parameters are milliseconds.

If the clock deviation between any OBServer and the NTP server does not exceed 100 ms, the clock deviation between any two OBServers does not exceed 200 ms.

If the environment requirements for clock deviation and network latency are not met, no leader is elected. The check on the receive window ensures the correctness of the election protocol regardless of whether the environment requirements are met.

Priorities

During the first phase of a leader election, all replicas broadcast their priorities at the same time. This way, each replica can compare the advantages and disadvantages of all replicas based on sufficient information and vote for the same replica. This avoids scattered voting results and ensures that a leader is elected as expected and in a predictable manner.

The priority of a replica is determined based on the following fields in sequence. The information contained in the parentheses for a field describes how field values are sorted.

Version number of a member list (a higher version number indicates a higher priority)
Attribute of the region where the replica belongs (primary region > non-primary region)
Memory status (not full > full)
Application initialization status (initialized > uninitialized)
Election blocklist status (not blocklisted > blocklisted)
Rebuild status (required > not required)
Data disk status (without errors > with errors)
Tenant memory status (not full > full)
Clog disk status (without errors > with errors)
Replica type (F replica > non-F replica)
Clog disk status (not full > full)
Clog log count (a larger count indicates a higher priority)

Leader switchover

You can specify a leader for a partition. Then, RootService sends a message to the partition to change the leader. This switchover method is efficient.

You can also change the primary region. Then, RootService notifies the partition to switch the leader to the new primary region.

After you enable the enable_auto_leader_switch option, RootService evenly distributes leaders in the primary zone based on the switchover mechanism.