Preparations for a failover |V2.2.77|OceanBase Database| docs|Distributed Database

Preparations for a failover

Last Updated：2023-08-18 09:26:34 Updated

This topic describes the requirements and the preparations for a failover and impact of a failover, including selecting a failover command and obtaining cluster information.

When the primary cluster is unavailable, you can switch a standby cluster to the primary role to provide services. You can select an appropriate standby cluster and failover command based on the protection mode, protection level, and synchronization status of each standby cluster.

Select a failover command

Query the PROTECTION_MODE and PROTECTION_LEVEL fields in the V$OB_CLUSTER view of each standby cluster. The failover command varies with the following two case:

Both PROTECTION_LEVEL and PROTECTION_MODE are MAXIMUM AVAILABILITY or MAXIMUM PROTECTION.

In this case, the primary and standby clusters are in SYNC mode before the primary cluster becomes unavailable, and the data in all partitions of the standby cluster is consistent and has not voids. In this case, you can run the lossless failover command to directly switch the standby cluster to the primary role without data loss.

The lossless failover command is as follows:
```
obclient> ALTER SYSTEM FAILOVER TO cluster_name CLUSTER_ID = cluster_id [FORCE];
```
You can also add the FORCE keyword to skip the protection mode and protection level check of the standby cluster.
PROTECTION_LEVEL and PROTECTION_MODE are set to other values.

In this case, it is impossible to determine whether the data in all partitions of the standby cluster is consistent and whether each partition has data voids. In this case, you can run the lossy failover command to switch the standby cluster to the primary role. During the failover, the data in all partitions is restored to the consistency snapshot, to ensure the integrity of the partition data before the snapshot point. After the failover is completed, the data in the system is consistent but not necessarily lossless.

The lossy failover command is as follows:
```
obclient> ALTER SYSTEM ACTIVATE PHYSICAL STANDBY CLUSTER [FORCE];
```

Requirements and impact

The requirements and impact of a failover are:

Requirements on the cluster status

Before you perform a failover, make sure that the cluster status meets the following requirements:
- All OBServers in the primary cluster are unavailable, to avoid simultaneous writes to the original and new primary clusters and enable the original primary cluster to connect to the new primary cluster after a lossless failover.
- All OBServers in the target standby cluster for which a lossy failover is to be performed are in the ACTIVE state, to ensure successful execution of the failover command.
Impact on the protection mode

After the failover, the clusters involved in the failover are automatically set to the maximum performance mode, regardless of the original protection mode. You need to reconfigure the protection mode.
Impact on indexes

During the failover, OceanBase Database automatically deletes invalid and redundant indexes. During a lossless failover, some valid indexes in the original primary cluster may be deleted from the new primary cluster. This is because indexes in the standby cluster are created asynchronously and thus inconsistent with those in the primary cluster. Therefore, even a lossless failover cannot avoid index losses.
Impact on replicas

The failover does not apply to the replicas being created. When a partition is created for a standby cluster, metadata is synchronized from the primary cluster. If the primary cluster is unavailable, the replica being created for the standby cluster may remain in the standby restore state.

If a replica is being created, the failover may get stuck until timeout. To ensure a successful failover, we recommend that you run the following SQL command to query partition replicas in the standby restore state. Notice

Before the query, make sure that the primary cluster is unavailable. Otherwise, the query result may be inaccurate.
```
obclient> SELECT * FROM __ALL_VIRTUAL_META_TABLE WHERE IS_RESTORE = 100;
```
Valid values of the IS_RESTORE field:
- 0: indicates a normal replica.
- 100: indicates a replica in the standby restore state.
Check the query result:
- If all replicas of a partition are in the standby restore state, the failover process skips this partition without requiring your intervention, but you need to manually process this issue after the failover is completed.
- If some replicas, including the leader, of a partition have been created and others are in the standby restore state, we recommend that you wait until the replicas in the standby restore state are created and then perform the failover.
- If minority replicas in a partition have been restored and majority replicas are in the standby restore state, the failover may get stuck. In this case, you need to forcibly delete all replicas in the partition so that the failover process skips this partition.

Obtain cluster information

Assume that one primary cluster and one standby cluster exist. This is only an example, and the actual environment prevails.

Obtain information about the primary cluster

obclient> SELECT CLUSTER_ID, CLUSTER_NAME, CLUSTER_ROLE FROM V$OB_CLUSTER;
+------------+--------------+--------------+
| CLUSTER_ID | CLUSTER_NAME | CLUSTER_ROLE |
+------------+--------------+--------------+
|          1          | obcluster            | PRIMARY          |
+------------+--------------+--------------+
1 row in set

Obtain information about the standby cluster

obclient> SELECT CLUSTER_ID, CLUSTER_NAME, CLUSTER_ROLE FROM V$OB_CLUSTER;
+------------+--------------+------------------+
| CLUSTER_ID | CLUSTER_NAME | CLUSTER_ROLE          |
+------------+--------------+------------------+
|          2         | obcluster             | PHYSICAL STANDBY |
+------------+--------------+------------------+
1 row in set