Isolate a node|V4.3.3| docs|Distributed Database

Isolate a node

Last Updated：2024-11-11 07:37:14 Updated

You can isolate a node for fault isolation or O&M changes. After a node is isolated, new read and write requests will not be routed to it. In this way, the fault is isolated or you can perform lossless O&M changes.

Fault isolation: When an exception occurs on a node, you must isolate this node from the business traffic. For example, if there is a switch malfunction in an IDC, some nodes may encounter packet loss, retransmissions, and even elections without a leader. In this scenario, you can isolate these faulty nodes for quick recovery. After the nodes resume normal, you can start them to restore the node traffic.
O&M changes: When a node requires O&M changes, make sure that the O&M changes are transparent to the business traffic. For example, when a node must be restarted, to avoid affecting the business traffic, you can isolate the node and switch the business traffic from this node to other nodes. After the observer process on the node is restarted, you can start the node.

Isolating a node simply involves diverting the business traffic away from that node without changing the number of voting members in the Paxos group. Different isolation commands provide different levels of security guarantees, and it is necessary to carefully consider the subsequent actions that can be safely executed based on the specific isolation command being executed. Isolation commands are suitable for scenarios where short-term isolation is required but continuous monitoring is necessary. It is important to have contingency plans for secondary failures in advance. If the recovery cannot be achieved within a short period, the node can be completely removed by replacing it. For detailed operations on replacing a node, see Replace a node.

Three isolation commands with different different levels of security guarantees are provided:

STOP SERVER
FORCE STOP SERVER
ISOLATE SERVER

After you isolate a node by using any of the preceding commands, if the value of the STATUS field is still ACTIVE but the value of the STOP_TIME field changes from NULL to the point in time when the node is isolated, the node is in the Stopped state.

STOP SERVER

You can use STOP SERVER to isolate multiple server nodes at a time. The syntax is as follows:

obclient [(none)]> ALTER SYSTEM STOP SERVER 'svr_ip1:svr_port1', 'svr_ip2:svr_port2', ...;

STOP SERVER has the highest security level. The STOP SERVER command can switch the business traffic away from the target node to isolate the node from the business traffic. It can also ensure that other nodes are still the majority in the Paxos group. In this way, you can securely perform all actions on the isolated node. For example, you can run the pstack command, adjust the log level, or stop the observer process. When you need to isolate a fault or stop a node for maintenance, you can use the STOP SERVER command, which is the best choice in this case.

The precheck for the STOP SERVER command is strict to ensure that any actions can be securely performed on the isolated node. Various internal status checks are performed to ensure that after the observer process on the target node is stopped, the replicas on the remaining available nodes in the cluster are still the majority in the Paxos group, thereby ensuring the continuity of database services.

Limitations of the STOP SERVER command are as follows:

You cannot stop nodes in multiple zones at the same time. If you stop nodes in multiple zones at the same time or a node in another zone is already stopped, the error -4660, cannot stop server or stop zone in multiple zones is returned. In this case, you can query the DBA_OB_SERVERS and DBA_OB_ZONES views to check whether nodes in multiple zones are stopped.
The STOP SERVER command will check the Paxos member lists of all log streams in all tenants to verify whether the replicas on the remaining nodes, except for the target node and other unavailable nodes, are still the majority in the Paxos group. If the replicas on the remaining nodes are not the majority, you cannot perform the Stop Server operation.
1. Nodes in the INACTIVE, Stopped (in the ACTIVE state and the value of the stop_time field is greater than 0), or DELETING state are unavailable nodes.
2. To check the Paxos member list of log streams, make sure that all log streams have a leader. If any log stream does not have a leader, the error code -4179 is returned and the information about this log stream is displayed.
  
  For example, assume that log stream 1 in the 1001 tenant does not have a leader. The error message Tenant(1001) LS(1) has no leader, stop server not allowed is returned. In this case, you can query the GV$OB_LOG_STAT view to check information about the leader of the log stream and troubleshoot the issue.
3. If the locality of the tenant is modified when you run the STOP SERVER command, a misjudgment may be made.
  
  For example, assume that the locality of the 1001 tenant is being changed to remove a replica and the member list of the log stream is changing from A, B, and C to A and B. If you run the STOP SERVER command to stop node B, the system may misjudge that node B can be stopped, which conflicts with the replica removal operation. To avoid misjudgments, the STOP SERVER command returns the error code -4179 and displays the modification information about the tenant Tenant(1001) locality is changing, stop server not allowed. In this case, you can query the DBA_OB_TENANTS view to check whether the locality of the tenant is modified.
4. After you check the target node and other unavailable nodes, if the replicas on the remaining nodes are not the majority, the -4179 error code is returned and the information about the abnormal log stream is displayed.
  
  For example, after you stop a node in log stream 1 of the 1001 tenant, if the replicas on the remaining nodes are not the majority, the error message Tenant(1001) LS(1) has no enough valid paxos member after stop server, stop server not allowed is returned. In this case, you can query the GV$OB_LOG_STAT view to check the member list and replica information of the log stream to verify whether the replicas on the remaining nodes are the majority.
After you run the STOP SERVER command, you can perform all invasive actions on the isolated node. Make sure that the replicas on the remaining available nodes are the majority and the logs of these replicas are synchronized. In this way, after the observer process on the isolated node is stopped, the log replicas on the remaining nodes are still the majority.
1. To determine whether the logs are synchronized, view the latest log checkpoints of the followers and leader in the log stream. If the difference between the checkpoints is within 5s, the logs are synchronized. You can query the END_SCN field in the GV$OB_LOG_STAT view for the latest log checkpoint of each replica in the specified log stream.
2. If the difference exceeds 5s, the error code -4179 is returned and the information about the abnormal log stream is displayed.
  
  For example, assume that logs of replicas in log stream 1 of the 1001 tenant are out of synchronization. The error message Tenant(1001) LS(1) log not sync, stop server not allowed is returned. In this case, you can query the END_SCN field in the GV$OB_LOG_STAT view to check the synchronization of logs among replicas in the log stream.
```
# Obtain the latest log checkpoints of replicas in a specified log stream and convert the checkpoints into timestamps.
obclient [(none)]> SELECT SCN_TO_TIMESTAMP(END_SCN) FROM GV$OB_LOG_STAT WHERE TENANT_ID = 1001 and LS_ID = 1;
```

STOP SERVER is the most secure isolation command and has the strictest prerequisites. In actual application scenarios especially in the case of emergency response to faults, the preceding prerequisites may not be fully met. In this case, the STOP SERVER command will fail. Therefore, OceanBase Database provides another two node isolation commands: FORCE STOP SERVER and ISOLATE SERVER. The two commands have less strict prerequisites and limits and apply to more scenarios. However, after a node is isolated by using one of the two commands, the actions that can be performed are limited. The STOP SERVER command is the best choice in all scenarios. If STOP SERVER is inapplicable, you can choose the other two isolation commands.

FORCE STOP SERVER

Unlike STOP SERVER, the FORCE STOP SERVER command skips the log synchronization check. After a node is isolated by using the FORCE STOP SERVER command, you can switch the business traffic away from the isolated node but you may not securely stop the observer process on the isolated node. If you forcibly stop the observer process on a node, the remaining nodes in the Paxos group may not be the majority, thereby affecting the continuity of database services.

Here is a typical application scenario: assume that you have three nodes A, B, and C in an OceanBase cluster, replica logs on node B have a high latency, for example, 1 hour, and node C encounters a hardware exception and must be stopped for maintenance. To ensure that the observer process can be securely stopped on the isolated node, the STOP SERVER command is the best choice to isolate the node. However, due to the long latency in replica logs on node B, the STOP SERVER command cannot be successfully run. In this case, you can only use the FORCE STOP SERVER command to isolate the node, preventing the hardware exception from affecting the business traffic first, and then identifying the cause of the latency in the logs of the replica on node B. After the logs of the replica on node B are synchronized, you can try to run the STOP SERVER command again. If the command is successfully executed, you can securely stop the node for maintenance.

obclient [(none)]> ALTER SYSTEM FORCE STOP SERVER 'svr_ip1:svr_port1', 'svr_ip2:svr_port2', ...;

Warning

The FORCE STOP SERVER command applies only when a latency exists between logs on other nodes except for the isolated node. In this case, you can isolate the node only for fault isolation. You must not perform any invasive action that will cause the observer process to stop. Otherwise, the replicas on the remaining nodes are not the majority in the Paxos group. As a result, no leader exists in the tenant and the continuity of services is affected.

ISOLATE SERVER

The precheck for the ISOLATE SERVER command is looser than those for STOP SERVER and FORCE STOP SERVER, contributing to the highest response speed. However, the ISOLATE SERVER command does not provide any security guarantee. That is, it does not guarantee that after the target node is removed, the replicas on the remaining nodes in the cluster are still the majority in the Paxos group to continue to provide services.

obclient [(none)]> ALTER SYSTEM ISOLATE SERVER 'svr_ip1:svr_port1', 'svr_ip2:svr_port2', ...;

The ISOLATE SERVER command loosens the precheck based on the following logic:

The ISOLATE SERVER command has no restrictions on the isolation of nodes in multiple zones. If a node in another zone is already isolated by using any isolation command, you can still run the ISOLATE SERVER command.

Here is a typical scenario: a cluster has three zones, Z1, Z2, and Z3. Z1 and Z3 each have a faulty node to be isolated. After the faulty node in Z1 is isolated, if you use STOP SERVER or FORCE STOP SERVER to isolate the faulty node in Z3, an error will be returned, indicating that you cannot isolate nodes in multiple zones at the same time. In this case, you can only use the ISOLATE SERVER command to switch the business traffic away from the faulty node. After the faulty node is isolated, You must not perform any invasive action that will cause the observer process to stop.
The ISOLATE SERVER command does not check whether the replicas on the remaining nodes in the log stream are still the majority, or whether the logs of these replicas are synchronized. Even if the required conditions are not met, the ISOLATE SERVER command can still be executed.

Here is a typical scenario: the initial Paxos member list of a log stream contains three nodes, A, B, and C. The server of node C fails and becomes permanently offline. Then, the member list of the log stream contains only nodes A and B and lacks replicas. Assume that node B encounters a hardware fault at this time and must be isolated. If you run STOP SERVER or FORCE STOP SERVER to isolate the node, the command will fail and an error message will be displayed, indicating that the remaining replicas are not the majority. In this case, you can only use the ISOLATE SERVER command to switch the business traffic away from the faulty node. After the faulty node is isolated, You must not perform any invasive action that will cause the observer process to stop.

Limitations of the ISOLATE SERVER command are as follows:

When nodes in multiple zones are isolated, the ISOLATE SERVER command will check the primary zone of each tenant in compliance with the principle that you cannot isolate all nodes in all zones in the region where the primary zone of a tenant is located. This prevents the read and write services of the tenant from being switched over to another region and avoids a sharp increase in the response time when business requests are routed across regions.

You can query the DBA_OB_TENANTS view for the primary zone of the tenant, and verify whether the preceding restrictions are violated based on the information about isolated zones and nodes queried in the DBA_OB_ZONES and DBA_OB_SERVERS views.

Warning

The ISOLATE SERVER command applies only when another node except for the isolated node becomes abnormal. In this case, you can isolate the node only for fault isolation. You must not perform any invasive action that will cause the observer process to stop. Otherwise, the replicas on the remaining nodes are not the majority in the Paxos group. As a result, no leader exists in the tenant and the continuity of services is affected.

Summary

The STOP SERVER command is highly recommended in most scenarios. After you run this command, the fault is isolated and you can securely perform all emergency actions and O&M actions on the isolated node. If the STOP SERVER command is inapplicable, you can choose one of the other two isolation commands. However, after any of the two isolation commands is executed, you cannot securely perform invasive actions.

The FORCE STOP SERVER command applies when a latency exists in the logs on other nodes except for the isolated node. The ISOLATE SERVER command applies when another node except for the isolated node is abnormal. In this case, you can isolate the node only for fault isolation. You must not perform any invasive action that will cause the observer process to stop. Otherwise, the replicas on the remaining nodes are not the majority in the Paxos group and the tenant has no leader, which will affect the continuity of database services.

References

For more node-related O&M operations, see the following topics: