Exceptions in the network, server, and disk can be divided into the following three types based on the impact scope:
Server exceptions
IDC exceptions
Regional exceptions
OceanBase Database provides multiple architecture solutions to meet the high availability requirements of different levels. The server-level high-availability solution is ensured by a triplicate architecture. The IDC-level high-availability solution is ensured by zones in different IDCs. The regional high-availability solution is ensured by five IDCs in three regions or by primary and secondary OceanBase databases. The exception handling logic varies with the solutions to different high availability levels.
Server exceptions
Most exceptions are server exceptions. Server exceptions can be divided into hardware exceptions and software exceptions. Common hardware exceptions include those caused by the CPU, storage hardware, and network interface card (NIC) failures. Common software exceptions include those caused by unexpected quits of operating systems and abnormal dependent services of OceanBase Database.
OceanBase Database adopts a triplicate architecture. This enables you to handle exceptions based on the number of faulty nodes.
Single-node failure: You can run the
STOP SERVERcommand to stop the faulty server. Then, the server stops providing external services. You can diagnose, replace, or repair the server as needed. This command is very helpful for operation, administration, and management (OAM) in the online environment. Note that theSTOP SERVERcommand does not stop the server processes or deactivate the server.For more information, see Manage OBServer node status.
Multi-node failure: This failure seldom happens, unless exceptions occur in the hardware of the same batch. To troubleshoot exceptions on multiple nodes, you must locate the faulty nodes. You can query the
__all_servertable of the sys tenant to check whether the faulty nodes belong to the same zone based on the IP addresses of the nodes. If the faulty nodes belong to the same zone, run theSTOP Zonecommand to stop the zone. Then, the zone stops providing external services. This command is effective in emergencies. Similarly, theSTOP Zonecommand does not stop the processes in the zone. Only the stop time is recorded in the__all_servertable. For more information, see Start or stop a zone.If the faulty nodes belong to different zones, check the number of active replicas in the OceanBase cluster. You can determine specific actions based on the analysis result, for example, replace the faulty nodes, scale out the cluster, or restore the cluster from a backup.
IDC exceptions
In an OceanBase cluster that supports disaster recovery at the IDC level, you can deploy zones in different IDCs. If an exception occurs in an IDC, traffic can be switched by using the zone management features. This method is similar to that for handling exceptions on multiple nodes. You can deploy three IDCs in a region with three or five replicas to implement zone disaster recovery. In a three IDC-based architecture, you can deploy one zone in each IDC. You can also deploy two zones in each of two IDCs and one zone in the rest one IDC. Each zone has a replica. In this architecture, OceanBase clusters support zone disaster recovery. The traffic loss for switchover among the IDCs is almost equivalent to the latency among these IDCs.
You can also deploy two IDCs in a region with three or five replicas to implement zone disaster recovery. In the two IDC-based architecture, one IDC is deployed with replicas that can meet the majority condition, while the other IDC is deployed with replicas that can meet the minority condition. The latter serves as the secondary IDC. In this architecture, OceanBase clusters also support zone disaster recovery. The traffic loss for switchover among the IDCs is almost equivalent to the latency among these IDCs. Traffic loss occurs only when an exception occurs in a replica of the majority.
When an exception occurs in an IDC, you can run the STOP Zone command to stop zones in the faulty IDC from providing external services. After the IDC recovers, you can run the START Zone command to restore the zones in the IDC. For more information, see Start or stop a zone.
If exceptions occur in multiple IDCs and the number of faulty replicas reaches the majority condition, you must stop the cluster and restore the cluster from backups.
Regional exceptions
OceanBase Database allows you to deploy five IDCs in three regions or deploy primary and secondary databases across regions to implement cross-region disaster recovery. If you deploy five IDCs in three regions and a regional exception occurs, you can switch the primary zone to a region where the service is normal. You can use OceanBase Migration Service (OMS) to synchronize OceanBase clusters of the primary and standby links. If you deploy primary and secondary databases across regions and a cluster exception occurs in a region, you can resume your service by switching the databases and applications to a normal region.
In general, the logic for handling OceanBase cluster exceptions is to utilize the high-availability feature of OceanBase Database that is ensured by multiple replicas. Inbound traffic of a database is switched to a normal replica to implement disaster recovery. This provides sufficient time for administrators, OAM engineers, and IDC engineers to handle exceptions at different levels and ensures service continuity.