Failures of a minority of nodes|V4.3.3| docs|Distributed Database

Failures of a minority of nodes

Last Updated：2024-12-02 03:48:29 Updated

This topic describes the troubleshooting procedure when a minority of nodes fail.

Background information

OceanBase Database is a distributed database that is typically built based on a multi-replica architecture. This topic describes the node failure that does not affect the majority of nodes in any log stream. For example, in the three-replica deployment architecture, the failure of any node does not affect the majority of nodes. In the five-replica deployment architecture, the failure of any two nodes does not affect the majority of nodes. If some log streams already lack replicas, the failures of a minority of nodes may affect the majority of nodes. For more information about this scenario, see Failure of the majority of nodes.

General node failures include the following ones:

A node fails due to a hardware exception.
A node encounters a hardware exception but does not fail. However, the observer process on this node exits abnormally. For example, a memory failure on the host causes the observer process to crash.
A node encounters a hardware exception but does not fail. However, the status of the observer process on this node becomes abnormal. For example, an I/O exception causes the read/write requests of the observer process to time out.
A node encounters a hardware exception but does not fail. However, errors occur in the data written by the observer process on the node into the SSTable or REDO log, and are detected in the subsequent checksum verification.
A bug in OceanBase Database causes an exception in the observer process, such as stuck threads, memory overrun, and process crashes.

Procedure

If a bug in OceanBase Database causes the observer process to become abnormal, you must first isolate the corresponding node. For more information about how to isolate a node, see Isolate a node. If services are not affected, you can contact OceanBase Technical Support for help without taking any emergency actions. If services are affected, restart or pull up the process and contact OceanBase Technical Support for help.

If a node becomes abnormal due to a hardware exception or a suspicious hardware exception is detected, you must perform the following steps to isolate and replace this node.

Log in to the sys tenant of the cluster as the root user.

Note that you must specify the corresponding fields in the following sample code based on your actual database configurations.
```
obclient -h10.xx.xx.xx -P2883 -uroot@sys -p***** -A
```
For more information about how to connect to a database, see Connection methods (MySQL mode) or Connection methods (Oracle mode).
Perform the Stop Server or Force Stop Server operation on the abnormal node as soon as possible and make sure that the operation succeeds to avoid affecting business traffic.

The statement of the Stop Server operation is in the following syntax:
```
obclient [(none)]> ALTER SYSTEM STOP SERVER 'svr_ip:svr_port';
```
Here is an example:
```
obclient [(none)]> ALTER SYSTEM STOP SERVER '172.xx.xx.xx:2882';
```
The statement of the Force Stop Server operation is in the following syntax:
```
obclient [(none)]> ALTER SYSTEM FORCE STOP SERVER 'svr_ip:svr_port';
```
Here is an example:
```
obclient [(none)]> ALTER SYSTEM FORCE STOP SERVER 172.xx.xx.xx:2882';
```
For more information about the internal execution procedures of the Stop Server and Force Stop Server operations, see Isolate a node.
Add a node to the zone where the abnormal node is located.

For more information about how to add a node, see Add a node.
Replicate the data of replicas on the abnormal node to the new node.

The following three cases exist:
- The observer process has exited.
  
  You can initiate a resource unit migration task or wait for RootService to automatically trigger a resource unit migration task. When the observer process reaches the permanent offline time specified by the server_permanent_offline_time parameter, RootService triggers a resource unit migration task. Make sure that the enable_rereplication parameter is enabled.
  
  For more information about how to manually initiate a resource unit migration task, see Migrate units.
- Data errors occur on the abnormal node, the resource unit migration task is stuck due to the abnormal status of the observer process, or the global status is abnormal (such as a stuck major compaction) due to the abnormal status of the observer process.
  
  You can kill the observer process and take actions described in the first case where the observer process has exited.
- Potential hardware risks or scenarios that will not cause a cluster to become globally abnormal.
  
  You do not need to kill the observer process. Killing the observer process may cause the risks to spread. You can initiate a resource unit migration task to migrate all resource units from the abnormal node to a new node.
  
  For more information about how to manually initiate a resource unit migration task, see Migrate units.
Note

We recommend that you initiate a resource unit migration task.
- If you initiate a resource unit migration task, you can migrate all resource units from the abnormal node to a new node while retaining the original unit topology, thereby avoiding affecting the load balancing among the nodes.
- RootService automatically triggers a resource unit migration task only when the observer process reaches the permanent offline time. This will spread the risks of the node exception.
- By initiating a resource unit migration task, you can check the resource usage of the migration traffic, thereby avoiding affecting the application traffic.
- Query the oceanbase.DBA_OB_UNIT_JOBS view for the data replication progress.
```
obclient [(none)]> SELECT * FROM oceanbase.DBA_OB_UNIT_JOBS WHERE JOB_TYPE = 'MIGRATE_UNIT';
+--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+
| JOB_ID | JOB_TYPE     | JOB_STATUS | RESULT_CODE | PROGRESS | START_TIME                 | MODIFY_TIME                | TENANT_ID | UNIT_ID | SQL_TEXT | EXTRA_INFO | RS_SVR_IP  | RS_SVR_PORT |
+--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+
|      4 | MIGRATE_UNIT | INPROGRESS |        NULL |        0 | 2023-01-04 17:22:02.208219 | 2023-01-04 17:22:02.208219 |      1004 |    1006 | NULL     | NULL       |xx.xx.xx.238|        2882 |
+--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+
```
  If the query result is empty, the resource units are migrated and data replication succeeds.
- Delete the abnormal node.
  
  For more information about how to delete a node, see Delete a node.

Failures of a minority of nodes

Background information

Procedure

Note