High availability of RootService
In an OceanBase cluster, RootService provides various management services for the cluster. The high availability of RootService is achieved as follows: RootService uses the Paxos protocol to ensure high availability. You can specify the number of RootService replicas in the cluster configuration. RootService replicas elect a leader based on the Paxos protocol. The leader provides RootService for the cluster. If the current leader of RootService fails, a new leader is elected from the replicas to continue to provide RootService. This ensures the high availability of RootService. Note that the replicas of RootService are not independent processes but a service that is started on specific nodes.
Node status monitoring
As the central control service of the cluster, RootService is responsible for node management in the cluster. Each node regularly (every 2 seconds) reports its process status to RootService in the form of a heartbeat data packet. RootService monitors the heartbeat data packets to obtain the status of the observer process.
Parameters related to node heartbeat status
lease_time: When RootService does not receive any heartbeat data packets from a node for a cumulative period longer thanlease_time, RootService considers that the observer process is temporarily disconnected. In this case, RootService marks the heartbeat status of the node aslease_expired.server_permanent_offline_time: When RootService does not receive any heartbeat data packets from a node for a cumulative period longer thanserver_permanent_offline_time, RootService considers that the observer process is disconnected. In this case, RootService marks the heartbeat status of the node aspermanent_offline.
Handling of RootService for OBServer node failures
RootService can determine the status of a node based on the heartbeat data packet received from the node:
A heartbeat data packet is received, and the disk status in the packet is normal. In this case, RootService considers that the node is working normally.
A heartbeat data packet is received, but the disk status in the packet is abnormal. In this case, RootService considers that the observer process is still running, but the node disk has failed. In this case, RootService tries to move all leader replicas on the node away.
No heartbeat data packet is received, but the loss of the heartbeat data packet is detected in a short time, and the heartbeat status of the node is
lease_expired. In this case, RootService only sets the status of the node to inactive and does not perform other processing actions.No heartbeat data packet is received, and the loss of the heartbeat data packet is detected for a period longer than
server_permanent_offline_time. In this case, the heartbeat status of the node ispermanent_offline. In this case, RootService deletes the data replicas on the node and adds the data replicas to other available nodes to ensure the completeness of the data replica Paxos member group.
Recovery of a failed node
A failed node in the cluster can be in one of the following conditions:
The node can be restarted. In this case, regardless of its heartbeat status, the node can provide services again after RootService receives heartbeat data packets from the node after the node is restarted.
The node cannot be restarted. In this case, after the database administrator confirms that the node cannot be restarted, the node is deleted from the cluster. You can perform the alter system delete server operation for node deletion.