RootService HA
In an OceanBase cluster, RootService contains various cluster management services. RootService achieves high availability (HA) by using the Paxos protocol. The number of RootService replicas can be specified in cluster configurations. The generated RootService replicas elect a leader based on Paxos. Then, the leader takes effect and provides RootService services for the cluster. If the current leader of RootService stops acting as the leader due to a fault, the remaining RootService replicas elect a new leader. The new leader provides RootService services to achieve RootService HA. A RootService replica is not an independent process, but a service started on specific OBServers.
OBServer status monitoring
As the central management service of a cluster, RootService manages the OBServers in the cluster. The OBServers report their process status to RootService every 2 seconds by using heartbeat packets. RootService monitors the heartbeat packets of the OBServers to obtain the status of the current observer processes.
Parameters for OBServer heartbeat status management
lease_time: If RootService does not receive a heartbeat packet from an OBServer within the period specified by lease_time, RootService considers the observer process temporarily disconnected and marks its heartbeat status as lease_expired.
server_permanent_offline_time: If RootService does not receive a heartbeat packet from an OBServer within the period specified by server_permanent_offline_time, RootService considers the observer process disconnected and marks its heartbeat status as permanent_offline.
OBServer node fault handling by RootService
RootService can obtain status information about an OBServer in one of the following scenarios based on heartbeat packets:
The OBServer properly sends heartbeat packets, and the disk status is normal in the heartbeat packets. In this case, RootService considers the OBServer is working properly.
The OBServer properly sends heartbeat packets, but the disk status is abnormal in the heartbeat packets. In this case, RootService considers the observer process still exists, but the disk of the OBServer is faulty. In addition, RootService switches over all leaders on this OBServer to another OBServer.
No heartbeat packet is received from the OBServer within the period specified by lease_time, and the heartbeat status of the OBServer is marked as lease_expired. In this case, RootService sets the status of the OBServer to inactive and does not take other measures.
No heartbeat packet is received from the OBServer within the period specified by server_permanent_offline_time, and the heartbeat status of the OBServer is marked as permanent_offline. In this case, RootService removes the data replicas on this OBServer from the Paxos member group, and adds the removed data replicas to another active OBServer to ensure that the Paxos member group of data replicas is complete.
Recovery of faulty OBServers
The following list describes two cases in which an OBServer is faulty:
The faulty OBServer can be restarted. After you restart the OBServer, it can resume services when the heartbeat connection to RootService is restored, regardless of the previous heartbeat status of the OBServer.
The faulty OBServer is damaged and cannot be restarted. In this case, you need to contact a database administrator (DBA) to remove the OBServer from the cluster. For more information about the removal procedure, see the description of the operation to delete a server.