High availability of OceanBase Database|V4.3.3| docs|Distributed Database

In addition to the high availability of ODP services, ODP also helps OceanBase Database achieve high availability. ODP is an important part of the high availability system of OceanBase Database. When an issue occurs in OceanBase Database, the database system must be able to recover services in a timely manner. On the other hand, ODP needs to perceive OBServer node failures and service changes.

ODP adopts fault detection, the blocklist mechanism, and SQL retries to implement the preceding features. Fault detection identifies faulty nodes. The blocklist mechanism influences ODP routing. SQL retries ensure that SQL statements are successfully executed as much as possible.

Fault detection and blocklist mechanism

Fault detection

For better understanding of fault detection, we need to first learn about fault categories and difficulties in detection

ODP faults can be divided into the following two categories:

Server and process failures, including server hardware issues and process issues such as server breakdowns, network failures, and process core dumps.
Business logic issues, including service unavailability caused by logic issues of OceanBase Database, for example, leader service interruption caused by Paxos election failures.

Failures caused by business logic are closely related to the database implementation, and therefore are difficult to define and solve. For example, service unavailability caused dead loop of business logic cannot be solved by using ODP.

Therefore, in addition to OBServer node and process failures, ODP deals with some business logic issues based on some known symptoms, for example, specific error codes returned by an OBServer node (such as those indicating insufficient machine memory), and no response from an OBServer node for a long time.

In a distributed system, fault detection results can be success, failure, or timeout. Among them, timeout is the trickiest one. For example, if an SQL statement has been executed for 100s and no result is returned, it is difficult to determine whether the OBServer node executing the SQL statement is normal. The OBServer node may execute SQL statements slowly, or the OBServer node may be faulty.

On the other hand, detection tasks are generally periodically scheduled. If the status of an OBServer node changes during the detection interval, ODP cannot detect the change in real time and may have to make routing choices based on expired information. This may cause slow SQL execution or SQL execution failures.

Fault detection is based on node status definitions. For example, two states are defined: healthy and faulty. However, OBServer node status is more complex, because more refined status definition helps achieve more high availability features and offer better user experience. ODP defines the following eight OBServer node states:

ACTIVE: The OBServer node is normal and can provide services.
INACTIVE: The OBServer node is abnormal and cannot provide services.
UPGRADE: The OBServer node is in the progress of database version upgrade.
REPLAY: The OBServer node is in the progress of database log playback.
DELETING: The OBServer node is being deleted. In this case, operations such as data migration may be in progress.
DELETED: The node has been deleted and no longer belongs to the cluster.
DETECT_ALIVE: ODP detects the OBServer node and considers the OBServer node normal.
DETECT_DEAD: ODP fails to detect the OBServer node and considers the OBServer node abnormal.

Though OBServer nodes have many states, these states essentially reflect the corresponding faults or operations. For example, INACTIVE indicates a core dump, DELETING or DELETED indicates that a node is being deleted or has been deleted, REPLAY indicates that a node is added, UPGRADE indicates a version upgrade, and DETECT_DEAD indicates that the process is hanging.

ODP can get the first six states by accessing an OBServer node view. ODP periodically obtains server status information of the cluster from the DBA_OB_SERVERS and DBA_OB_ZONES views at an interval of 20s. ODP determines the OBServer node status based on the status, start_service_time, and stop_time fields in the query result. ODP obtains the DETECT_ALIVE and DETECT_DEAD states by using the detection mechanism.

The query result is as follows:

MySQL [oceanbase]> select * from DBA_OB_SERVERS\G
*************************** 1. row ***************************
               SVR_IP: 10.10.10.2
             SVR_PORT: 2882
                   ID: 1
                 ZONE: zone1
             SQL_PORT: 2881
      WITH_ROOTSERVER: YES
               STATUS: ACTIVE
   START_SERVICE_TIME: 2022-10-31 11:48:38.677315
            STOP_TIME: NULL
BLOCK_MIGRATE_IN_TIME: NULL
          CREATE_TIME: 2022-10-31 11:48:24.684250
          MODIFY_TIME: 2022-10-31 11:48:39.682642
        BUILD_VERSION: 4.0.0.0_100000252022102910-df01cef074936b9c9f177697500fad1dc304056f(Oct 29 2022 10:27:50)

MySQL [oceanbase]> select * from DBA_OB_ZONES\G
*************************** 1. row ***************************
       ZONE: zone1
CREATE_TIME: 2022-10-31 11:48:29.040552
MODIFY_TIME: 2022-10-31 11:48:29.041609
     STATUS: ACTIVE
        IDC:
     REGION: default_region
       TYPE: ReadWrite

Blocklist mechanism

After the status of an OBServer node changes, ODP triggers the logic related to high availability based on the status change result. High availability is closely related to the blocklist mechanism.

Specifically, after ODP detects the status of an OBServer node, ODP modifies the blocklist and then adds the node to or removes the node from the blocklist. Based on the detection mechanisms, ODP implements three blocklists: status blocklist, detection blocklist, and alive-but-unavailable blocklist.

Status blocklist

The status blocklist depends on the status changes of OBServer nodes. Due to historical reasons, the DETECT_ALIVE and DETECT_DEAD states are not related to the status blocklist. The changes of other states can be obtained from views of OBServer nodes.

When ODP obtains the latest status of an OBServer node by using a scheduled task, ODP performs the corresponding operation based on the status of the OBServer node.

ACTIVE: ODP removes the OBServer node from the status blocklist.
INACTIVE/REPLAY: ODP adds the OBServer node to the status blocklist.
DELETED/DELETING: ODP updates the OBServer node list in the memory and no longer forwards SQL requests to this OBServer node.
UPGRADE: ODP neither adds the OBServer node to the status blocklist nor forwards SQL requests to the OBServer node. This is equivalent to adding the OBServer node to the status blocklist.

Detection blocklist

For the status blocklist, ODP obtains information from RootService of OceanBase Database. Such information sometimes may be unable to reflect the conditions between ODP and OBServer nodes. As shown in the following figure, ODP learns from RootService that OBServer1 is in the ACTIVE state, but the network between ODP and OBServer1 is disconnected.

Example

Therefore, ODP implements the detection blocklist based on the status blocklist. ODP sends a detection SQL statement to an OBServer node to determine its status. ODP sends a detection SQL statement select 'detect server alive' from dual to the sys tenant of the OBServer node and sets the timeout period to 5s. If no result is returned within the timeout period, ODP sends a detection statement again. If the detection fails for three consecutive times, ODP sets the status of this OBServer node to DETECT_DEAD. If a result is returned, ODP clears the number of detection failures and sets the status of this OBServer node to DETECT_ALIVE. After ODP detects a status change, ODP performs the corresponding operation.

DETECT_ALIVE: ODP removes the OBServer node from the detection blocklist.
DETECT_DEAD: ODP adds the OBServer node to the detection blocklist and closes all connections to this OBServer node.

Note

If ODP adds an OBServer node to the detection blocklist but does not close the connections to the OBServer node, the connections are always occupied and no new SQL requests can be sent. The performance data shows that the transactions per second (TPS) of this OBServer node is 0 during this period. After the connections to the OBServer node are closed, ODP will route subsequent requests based on the blocklist and will not forward requests to this OBServer node. Then, the TPS can resume.

Alive-but-unavailable blocklist

ODP can properly handle server and process faults based on the status blocklist and detection blocklist. However, ODP wants to further perceive the cause of each failed SQL statement execution so as to handle complex situations. The alive-but-unavailable blocklist is implemented for this purpose.

In the alive-but-unavailable blocklist mechanism, ODP defines the status of an OBServer node based on the execution results of business SQL statements and adds the OBServer node to or removes it from the blocklist based on the status. ODP is more cautious about adding an OBServer node to or removing an OBServer node from the alive-but-unavailable blocklist than with the status blocklist and the detection blocklist. This aims to avoid misjudgments that are made based on a single SQL execution result. The operation performed by ODP on an OBServer node vary with the actual situation:

ODP will record a failure event in any of the following cases:
- ODP sends a detection SQL statement to the OBServer node and does not receive any response after the timeout period specified by the ob_query_timeout parameter expires.
- The OBServer node returns the OB_SERVER_IS_INIT, OB_SERVER_IS_STOPPING, OB_PACKET_CHECKSUM_ERROR, or OB_ALLOCATE_MEMORY_FAILED error code.
- ODP fails to connect to the OBServer node, parse packets, or transmit data.
If the OBServer node has five alive-but-unavailable records within 120s, ODP will add it to the alive-but-unavailable blocklist.
ODP will attempt to resend a detection SQL statement to the OBServer node 20s later after it adds the OBServer node to the alive-but-unavailable blocklist.
If the detection SQL statement is successfully executed, ODP will remove the OBServer node from the blocklist.

These three blocklists can help resolve issues from different dimensions. Exceptions may fail to be handled with the absence of any of these blocklists. These blocklists have no precedence over one another. An OBServer node in any blocklist cannot receive requests. In other words, an OBServer node can receive requests only if it is not in any blocklist.

Regular proxies only implement a detection mechanism to check the health status of backend servers. This type of detection can resolve only simple issues. ODP provides various features based on the features of OceanBase Database, such as perception of OceanBase Database version upgrade, log replay, and SQL execution errors. This far exceeds regular proxies.

View blocklists

You can perform the following steps to view blocklists:

Log on to the sys tenant of the OceanBase cluster through ODP as the root user from a client.
```
obclient -h10.10.10.1 -uroot@sys#obdemo -P2883 -p -c
```
In the preceding example, the IP address of the ODP server is 10.10.10.1, the cluster name is obdemo, and the connection port is 2883. If you connect to the cluster by using OBClient, you need to modify the parameters based on the actual situation.
Execute the following statement to view blocklists:
```
SHOW PROXYCONGESTION [all] [clustername];
```
where
- If no option is specified, the blocklists of all clusters are retrieved.
- If only the all option is specified, OBServer nodes of all clusters, including OBServer nodes in and those not in the blocklists, are retrieved.
- If only the clustername option is specified, the blocklists of the specified cluster are retrieved. The clustername value must be enclosed in single or double quotation marks.
- If both the all and clustername options are specified, the OBServer nodes of the specified cluster are retrieved.

Here is an example of viewing the blocklist of the single-node cluster obcluster:

show proxycongestion 'obcluster'\G

The output is as follows:

*************************** 1. row ***************************
         cluster_name: obcluster
            zone_name: zone1
          region_name: sys_region
           zone_state: ACTIVE
            server_ip: xxx.xxx.xxx.xxx:2881
           cr_version: 5
         server_state: ACTIVE
      alive_congested: 0
 last_alive_congested: 0
       dead_congested: 0
  last_dead_congested: 0
  stat_alive_failures: 0
   stat_conn_failures: 0
  conn_last_fail_time: 0
  conn_failure_events: 0
 alive_last_fail_time: 0
 alive_failure_events: 0
            ref_count: 2
     detect_congested: 0
last_detect_congested: 0

Note

dead_congested and alive_congested indicate whether an OBServer node is in a blocklist and an alive-but-unavailable blocklist respectively. If the value of either field is 1, this OBServer node has been added to a blocklist.
If an OBServer node is added to an alive-but-unavailable blocklist, ODP will not forward requests to this OBServer node within the period specified by congestion_retry_interval, and will initiate a retry after the period specified by congestion_retry_interval. In addition, this OBServer node cannot be removed from the blocklist within the period specified by min_keep_congestion_interval.

SQL retries

The SQL retry feature can be implemented based on blocklists. SQL retries aim to ensure the execution of SQL statements instead of simply throwing an error to the client.

The prerequisite for retrying an SQL statement is that unexpected behavior must not happen. For example, if the OBServer node does not respond a long time after you execute the insert into t1 values (c1 + 1, 1) statement, ODP cannot forward this statement to another OBServer node. Otherwise, c1 may change to c1 + 2. An incorrect retry is more harmful.

ODP may perform a retry before or after forwarding to the OBServer node. The following sections describe the two types of retries.

Retries before forwarding

This section describes retries before forwarding based on the ODP architecture diagram. In the following figure, the part in orange indicates the position where a retry before forwarding occurs.

Retry before forwarding

In the following example, a regular SQL statement select * from t is used to describe the logic of a retry before forwarding.

Assume that t is a partitioned table and the leader is located on OBServer1. After RootService detects a core dump in the observer process of OBServer1, RootService sets the status of OBServer1 to INACTIVE.

ODP selects OBServer1 in the data routing phase, and then performs a disaster recovery check and finds that OBServer1 is in the status blocklist. In this case, a retry is required. ODP uses intra-tenant routing to select another OBServer node in the server list of the tenant and then performs a disaster recovery check. If the selected OBServer node passes the check, ODP forwards the request to the node.

Earlier detection of an issue facilitates faster rectification of the issue. A retry before forwarding occurs only within ODP. Therefore, rapidly rectified issues are imperceptible to business systems.

Retries after forwarding

A retry before forwarding is efficient but depends on the accuracy of blocklists. In a distributed system, ODP cannot perceive the server status in real time. Therefore, ODP may not know that an OBServer node is faulty before forwarding. In this case, a retry can be performed based on the forwarding result.

Two forwarding results are commonly seen:

The TCP connection fails because the OBServer node is faulty.

In this case, ODP can select another OBServer node if no special restrictions, such as transaction routing, are involved.
The OBServer node encounters an execution failure and returns an error code.

For more information about error codes, see the open-source code ObMysqlTransact::handle_oceanbase_server_resp_error of ODP. For example, if the OBServer node returns the error code REQUEST_SERVER_INIT_ERROR, the OBServer node initialization fails. In this case, reforwarding can be performed and the logic is similar to the logic of a retry before forwarding.

With retries before forwarding and retries after forwarding, ODP can handle most OBServer node issues. The entire process is transparent to users and users may not perceive the faults. A retry before forwarding is efficient but depends on the blocklist mechanism. A retry after forwarding is a supplement to the retry before forwarding and aims to shield backend exceptions as much as possible.

In short, fault detection, blocklists, and SQL retries are three important features to ensure high availability for ODP. They are also mutually related. Both the retry after forwarding and the alive-but-unavailable blocklist depend on the feedback from OBServer nodes. Retries before forwarding depend on the blocklist mechanism.

With these three features, ODP can handle most OBServer node exceptions such as network failures, observer process exceptions, and server breakdowns. Moreover, ODP can identify logic exceptions in OceanBase Database such as version upgrade and log replay, to improve the usability of OceanBase Database.

Correlation with the high availability of OceanBase Database

Unlike ODP, OceanBase Database uses the Paxos algorithm to ensure high availability. When the majority of nodes in OceanBase Database are normal, services can still be provided properly and exceptions to a minority of nodes can be tolerated. However, the high availability feature of OceanBase Database may cause the leader role to be switched from Node A to Node B. Therefore, ODP must find the normal service node more efficiently and accurately. In this way, the high availability feature of ODP and that of OceanBase Database work together to provide stable services.

For more information about the high availability feature of OceanBase Database, see High availability.