Node failures

2023-10-27 09:57:43  Updated

If an OBServer node fails and the database service becomes unavailable due to hardware faults, you must take emergency measures on the faulty OBServer node. This topic describes the emergency procedure to handle node failures.

Emergency procedure

  • Replace the OBServer node with confirmed hardware faults.

    Make sure that enough redundant OBServer nodes exist in the cluster that can provide services in place of the faulty OBServer node. For more information, see Replace an OBServer node.

  • The observer process exits due to unknown causes other than hardware faults.

    For example, if a core dump occurs, try to restore the process. If the process cannot be restored, see Unexpected observer process exit for troubleshooting.

If you cannot confirm whether the OBServer node fails due to hardware faults, you can try to fix other exceptions, such as network jitter or power loss, and then restart the observer process. In this case, you need to set the server_permanent_offline_time parameter. OceanBase Database provides the server_permanent_offline_time parameter to control the duration after which a faulty OBServer is removed from the cluster since it becomes unavailable. Assume that you set the server_permanent_offline_time parameter to 2 hours. Then, if an OBServer node fails and is restored within 2 hours, the data on the OBServer node remains available and the incremental data generated during the downtime can be synchronized. However, if the OBServer node is restored 2 hours later, all replicas on the OBServer node are cleared because RootService removes it from the Paxos group. In this case, the OBServer node must obtain all data from other replicas.

Contact Us