This topic provides information on how to handle node failure in emergency situations.
Emergency handling procedure
For known hardware failure, the node should be replaced directly.
First, ensure that there are enough redundant OBServer nodes in the cluster that can take over the tasks of the failed node.
In scenarios where the observer process exits due to non-hardware failures or unknown reasons, such as in a core dump scenario, the emergency handling approach is to attempt recovery by restarting the process.
In situations where it cannot be immediately confirmed whether node failure is caused by a hardware issue (such as network jitter, power loss, or other reasons), users may choose not to immediately replace the node. Instead, they address other issues before attempting to restart the observer. In this case, it is important to pay attention to the setting of the server_permanent_offline_time parameter.
The server_permanent_offline_time parameter specifies the duration after which OceanBase removes a node from the member list if it remains unavailable for a certain period of time in the cluster. For example, suppose the value of server_permanent_offline_time in an OB cluster is set to 2 hours. This means that if a node fails and can be recovered within 2 hours, the data on this node will still be valid, and only the incremental data during the downtime needs to be synchronized. However, if the downtime exceeds 2 hours, all replicas on the node will be cleared as RootService removes it from the Paxos group. In this case, the OBServer node must retrieve all data from other replicas.