Arbitration-based high availability solution|V4.3.3| docs|Distributed Database

Arbitration-based high availability solution

Last Updated：2024-11-11 07:37:14 Updated

Applicability

This topic applies only to OceanBase Database Enterprise Edition. OceanBase Database Community Edition does not support the arbitration service.

About arbitration

Arbitration is a process where the parties in dispute select an independent and impartial third party (the arbitral tribunal) to resolve the dispute.

In a distributed database, when half of the data replicas are abnormal (the replicas fail or are isolated from the other half of replicas on the network), the arbitration service deployed outside the cluster can participate in decision-making on changes (leader election or member group changes) to recover database services. OceanBase Database V4.1.0 and later support the arbitration service, which is an independent process that provides arbitration for OceanBase clusters. It is a new high availability solution provided on the basis of Paxos-based multi-replica disaster recovery.

The arbitration service is independent of OceanBase clusters. It is a lightweight observer process that starts in special mode and hosts arbitration members at the log stream level. An arbitration member has the following characteristics:

An arbitration member votes in leader election, Paxos Prepare, and member group change, but does not participate in Paxos Accept.
An arbitration member does not store logs and has no MemTables or SSTables. Therefore, it consumes minimal bandwidth, memory, disk, and CPU resources.
An arbitration member cannot be elected as a leader to provide services.

Advantages of the arbitration service

The arbitration-based high availability solution has the following advantages:

Arbitration members can replace log replicas. This reduces deployment costs while ensuring service availability.
If half of the full-featured replicas fail, the arbitration service automatically performs a downgrade to recover the services. The downgrade does not affect the business response time because logs do not need to be synchronized across regions.
Arbitration members do not need to synchronize or replay logs or store redo logs or baseline data. Therefore, the CPU, memory, disk, and bandwidth usage of the arbitration service is significantly reduced.
Compared with the primary/standby cluster solution, the arbitration service implements lossless active-active disaster recovery based on two full-featured replicas with very low resource overheads. It supports automatic upgrade and downgrade without data loss (RPO = 0).

Scenarios

The arbitration-based high availability solution is applicable to the following scenarios:

Two IDCs are deployed in the same region, and the arbitration service can be deployed independently across regions.
Cross-region bandwidth resources are limited.
Costs must be controlled, or the budget is limited.
Lossless active-active disaster recovery is expected, and availability slightly lower than that with three or five full-featured replicas is acceptable.

Compared with the deployment solution with all full-featured replicas, the arbitration-based high availability solution is more cost-effective but provides slightly lower availability. For example, assume that two full-featured replicas and one arbitration service are deployed. If a downgrade is triggered after one full-featured replica fails, only the other full-featured replica is available for log synchronization. If the other full-featured replica also fails and cannot be recovered, data loss occurs even if the first full-featured replica is subsequently recovered. This is because arbitration members do not store redo logs. However, this problem will not occur when three full-featured replicas are deployed.

The arbitration-based high availability solution is not applicable to the following scenarios:

Only two IDCs are deployed, and no additional servers or IDCs are available for deploying the arbitration service. In this scenario, you can use the primary/standby cluster solution.
Maximum system availability is expected, and costs of full-featured replicas are acceptable. In this scenario, you can use the Paxos-based multi-replica disaster recovery solution.

For more information about the comparison between disaster recovery solutions, see Deployment solutions for disaster recovery.

Configure the arbitration service

Prerequisites

The tenant has two or four full-featured replicas.

Deploy the arbitration service

The arbitration service is deployed in standalone mode. You must prepare a server whose resource specifications meet the deployment requirements. Then you can start the observer process by using the startup parameters specific to the arbitration mode. For more information, see Deploy a two-replica OceanBase cluster with the arbitration service by using the CLI.

Example

Use the arbitration service

After the arbitration service is deployed, add the arbitration service to an OceanBase cluster. This way, tenants in the cluster can use the arbitration service. One arbitration service can serve multiple OceanBase clusters.

Enable or disable the arbitration service

You can enable or disable the arbitration service for a tenant in an OceanBase cluster.

Automatic upgrade or downgrade

After the arbitration service is enabled for a tenant, if more than half of the full-featured replicas of the tenant fail, which means that logs on the majority of replicas cannot be synchronized, the arbitration service automatically triggers a downgrade to recover database services, with a recovery point objective (RPO) of 0. After the failed replicas are recovered, the cluster automatically upgrades and restores the original replica list.

The following figure shows the principle of automatic upgrade and downgrade. During a downgrade, a failed full-featured replica is converted into a learner. During an upgrade, a learner is converted into a full-featured replica.

Principle of upgrade and downgrade

A downgrade can be triggered in the following scenarios:

The server is disconnected or down.
Replica rebuild is triggered.
The STOP SERVER, STOP ZONE, or ISOLATE SERVER statement is executed.
The kill -9, kill -15, or kill -19 command is executed for the observer process.
The log disk for the tenant on the server is full or hangs. The log disk is the disk where the clog directory resides.

Manage the arbitration service

After you enable the arbitration service in an OceanBase cluster, you can replace or remove the arbitration service.