This topic describes the troubleshooting procedure when a minority of nodes fail.
Background information
OceanBase Database is a distributed database that is typically built based on a multi-replica architecture. This topic describes the node failure that does not affect the majority of nodes in any log stream. For example, in the three-replica deployment architecture, the failure of any node does not affect the majority of nodes. In the five-replica deployment architecture, the failure of any two nodes does not affect the majority of nodes. If some log streams already lack replicas, the failures of a minority of nodes may affect the majority of nodes. For more information about this scenario, see Failure of the majority of nodes.
General node failures include the following ones:
A node fails due to a hardware exception.
A node encounters a hardware exception but does not fail. However, the observer process on this node exits abnormally. For example, a memory failure on the host causes the observer process to crash.
A node encounters a hardware exception but does not fail. However, the status of the observer process on this node becomes abnormal. For example, an I/O exception causes the read/write requests of the observer process to time out.
A node encounters a hardware exception but does not fail. However, errors occur in the data written by the observer process on the node into the SSTable or REDO log, and are detected in the subsequent checksum verification.
A bug in OceanBase Database causes an exception in the observer process, such as stuck threads, memory overrun, and process crashes.
Procedure
If a bug in OceanBase Database causes the observer process to become abnormal, first isolate the node as soon as possible. For more information about isolating a node, see Isolate a node. If services are not affected, you can preserve the scene and contact OceanBase Technical Support for assistance. If services are affected, restart or pull up the process as soon as possible and then contact OceanBase Technical Support for assistance.
If a node becomes abnormal due to a hardware exception or a suspicious hardware exception is detected, you must perform the following steps to isolate and replace this node.
Log in to the
systenant of the cluster as therootuser.Note that you must specify the corresponding fields in the following sample code based on your actual database configurations.
obclient -h10.xx.xx.xx -P2883 -uroot@sys#obdemo -p***** -AFor more information about how to connect to a database, see Connection methods overview (MySQL-compatible mode) and Connection methods overview (Oracle-compatible mode).
Perform the
Stop ServerorForce Stop Serveroperation on the abnormal node as soon as possible and make sure that the operation succeeds to avoid affecting business traffic.The
Stop Serverstatement is as follows:obclient [(none)]> ALTER SYSTEM STOP SERVER 'svr_ip:svr_port';Example:
obclient [(none)]> ALTER SYSTEM STOP SERVER '172.xx.xx.xx:2882';The
Force Stop Serverstatement is as follows:obclient [(none)]> ALTER SYSTEM FORCE STOP SERVER 'svr_ip:svr_port';Example:
obclient [(none)]> ALTER SYSTEM FORCE STOP SERVER '172.xx.xx.xx:2882';For more information about the internal execution procedures of the
Stop ServerandForce Stop Serveroperations, see Isolate a node.Add a node to the zone where the abnormal node is located.
For more information about how to add a node, see Add a node.
Complete the replicas on the new node that were on the abnormal node.
The following three cases exist:
The observer process has exited.
You can initiate a unit migration or wait for RootService to automatically initiate a unit migration. RootService automatically initiates a unit migration when the observer process has been offline for longer than the permanent offline time. The permanent offline time is controlled by the
server_permanent_offline_timeparameter. Make sure that theenable_rereplicationparameter is enabled.For more information about how to manually initiate a unit migration, see Unit migration.
Data errors on the abnormal node, the unit migration is stuck due to the abnormal status of the observer process, or the global status is abnormal (such as a stuck compaction) due to the abnormal status of the observer process.
You can kill the observer process and then refer to the first case (observer process has exited) for handling.
Hardware risks or scenarios that will not cause the global status of the cluster to become abnormal.
You do not need to kill the observer process (to avoid increasing cluster risk exposure after killing the observer). You can initiate a unit migration to migrate all units from the abnormal node to the new node.
For more information about how to manually initiate a unit migration, see Unit migration.
Note
We recommend that you always initiate a unit migration:
- Initiating a migration allows you to control migrating all units from the abnormal node to the new node, avoiding RootService auto-initiation and thereby preserving the original unit topology and load balancing among nodes.
- RootService automatically initiates a unit migration only when the process has been offline for longer than the permanent offline time, which increases the risk exposure of the node exception.
- By initiating a unit migration, you can monitor the impact of migration traffic on cluster resources in real time and avoid affecting application traffic.
If the node exception is detected after the process has been offline for longer than the permanent offline time and RootService has already automatically initiated a unit migration, monitor the topology after migration, monitor the impact of migration traffic on cluster resources, and keep business personnel informed.
After the operation, you can view the data completion progress through the
oceanbase.DBA_OB_UNIT_JOBSview.obclient [(none)]> SELECT * FROM oceanbase.DBA_OB_UNIT_JOBS WHERE JOB_TYPE = 'MIGRATE_UNIT'; +--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+ | JOB_ID | JOB_TYPE | JOB_STATUS | RESULT_CODE | PROGRESS | START_TIME | MODIFY_TIME | TENANT_ID | UNIT_ID | SQL_TEXT | EXTRA_INFO | RS_SVR_IP | RS_SVR_PORT | +--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+ | 4 | MIGRATE_UNIT | INPROGRESS | NULL | 0 | 2023-01-04 17:22:02.208219 | 2023-01-04 17:22:02.208219 | 1004 | 1006 | NULL | NULL |xx.xx.xx.238| 2882 | +--------+--------------+------------+-------------+----------+----------------------------+----------------------------+-----------+---------+----------+------------+------------+-------------+If the query result is empty, unit migration is complete and data replenishment has succeeded.
After the units on the abnormal node have been replenished on the new node, delete the abnormal node.
For more information about how to delete a node, see Delete a node.