RootService HA
In an OceanBase cluster, RootService provides various management services for the cluster. RootService implements high availability (HA) based on the Paxos protocol. You can specify the number of replicas of RootService when you configure a cluster. The replicas of RootService elect a leader based on the Paxos protocol. The leader replica of RootService then provides RootService services for the cluster. If the current leader of RootService stops acting as the leader due to a failure, the remaining RootService replicas elect a new leader. The new leader provides RootService services to achieve RootService HA. A RootService replica is not an independent process, but a service started on specific OBServer nodes.
Monitoring of the OBServer node status
As the central management service of a cluster, RootService manages the OBServer nodes in the cluster. The OBServer nodes report their process status to RootService every two seconds by using heartbeat packets. RootService monitors the heartbeat packets of the OBServer nodes to obtain the status of the current observer processes.
Parameters for managing the heartbeat status of OBServer nodes
lease_time: If RootService does not receive a heartbeat packet from an OBServer node within the period specified by
lease_time, RootService considers the corresponding observer process temporarily disconnected and marks its heartbeat status aslease_expired.server_permanent_offline_time: If RootService does not receive a heartbeat packet from an OBServer node within the period specified by
server_permanent_offline_time, RootService considers the corresponding observer process disconnected and marks its heartbeat status aspermanent_offline.
OBServer node failure handling by RootService
RootService can obtain the status information of an OBServer node in the following scenarios based on heartbeat packets:
The OBServer node properly sends heartbeat packets, and the disk status is normal in the heartbeat packets. In this case, RootService considers that the OBServer node is working properly.
The OBServer node properly sends heartbeat packets, but the disk status is abnormal in the heartbeat packets. In this case, RootService considers that the observer process still exists, but the disk of the OBServer node fails. In addition, RootService switches all leaders on this OBServer node to another OBServer node.
No heartbeat packet is received from the OBServer node within the period specified by
lease_time, and the heartbeat status of the OBServer node is marked aslease_expired. In this case, RootService sets the status of the OBServer node to inactive and does not take other measures.No heartbeat packet is received from the OBServer node within the period specified by
server_permanent_offline_time, and the heartbeat status of the OBServer node is marked aspermanent_offline. In this case, RootService removes the data replicas on this OBServer node from the Paxos member group, and adds the removed data replicas to another active OBServer node to ensure that the Paxos member group of data replicas is complete.
Handling of failed OBServer nodes
The following list describes two cases in which an OBServer node fails and how the failure can be handled:
The failed OBServer node can be restarted. After you restart the OBServer node, it can resume services when the heartbeat connection to RootService is restored, regardless of the previous heartbeat status of the OBServer node.
The failed OBServer node is damaged and cannot be restarted. In this case, you need to contact a database administrator (DBA) to remove the OBServer node from the cluster. For more information about the removal procedure, see the description of the operation to delete a server.