Restart a node

2025-12-15 07:29:00  Updated

Restart is a common O&M action. It is suitable for brief maintenance of a server or when a system configuration item is modified and needs to take effect. If the node goes offline for more than the time specified in the server_permanent_offline_time parameter, it will be permanently offline. For long-term maintenance of a server, the server replacement procedure must be followed. For more information, see Replace a node.

Note

The cluster-level parameter server_permanent_offline_time specifies the threshold of the duration of interrupted heartbeats, after which the node is considered permanently offline. Automatic replica supplementation is required for the data replicas on a permanently offline node. The default value is 3600s. For more information about this parameter, see server_permanent_offline_time.

Background information

As a distributed database, OceanBase Database is typically deployed with multiple replicas (for example, three replicas in a geo-distributed three IDC deployment or five replicas in a geo-distributed three IDC deployment). When a transaction is committed, the Paxos protocol is used to achieve majority consensus among the replicas to maintain data consistency. In the case of an exception in a minority of replicas, the service level agreement (SLA) with RPO=0 can be achieved.

The STOP SERVER command can restart a server without data loss in a multi-replica architecture. The STOP SERVER command performs the following operations:

  1. Strip all leaders from the server to be restarted and ensure that a majority of replicas remains on other servers.

  2. Mark the server to be restarted as stopped in the Root Service (the node status changes to ACTIVE and the stop_time field is greater than 0). After the client recognizes this status, it will not route business requests to the server.

After the server is successfully stopped, the restart of the server does not cause leader elections or client errors, thus remaining transparent to the business traffic. If the STOP SERVER command fails to execute, you need to stop the restart and check the cause. Some possible causes include insufficient replicas, redo log latency, and a total number of voting members smaller than the majority.

Procedure

The major steps to restart a node are: stop services, perform a minor compaction, shut down the process, start the process, and start services.

This topic provides the procedure to restart a node in a cluster. If you want to restart multiple nodes, you can perform the same operation multiple times.

  1. Log in to the sys tenant of the cluster as the root user.

    Note that you must specify the corresponding parameters in the following sample code based on your database connection information.

    obclient -h10.xx.xx.xx -P2883 -uroot@sys#obdemo -p***** -A
    

    For more information about how to connect to a database, see Overview (MySQL mode) and Overview (Oracle mode).

  2. Run the following command to isolate the node.

    During a node restart, the service continuity may be interrupted. For example, if a cluster has only one or two nodes or the data of a tenant is distributed on only two nodes, the system may be unable to provide services during the restart of a node. The Stop Server operation is designed to interrupt service continuity. Before you perform the Stop Server operation, the system will perform a safety check to ensure that the service continuity is not interrupted during the restart of the node. After the node is successfully isolated, it will no longer provide services. If the Stop Server operation fails, you can perform the operation again or isolate the node manually after you resolve the issues based on the error messages; or if you can tolerate the interruption of the services, you can skip this step.

    obclient [(none)]> ALTER SYSTEM STOP SERVER 'svr_ip:svr_port';
    

    The parameters are described as follows:

    • svr_ip: the IP address of the node to be stopped.

    • svr_port: the RPC port of the node to be stopped. The default value is 2882.

    Here is an example:

    obclient [(none)]> ALTER SYSTEM STOP SERVER '172.xx.xx.xx:2882';
    

    After the execution is successful, query the STATUS column of the oceanbase.DBA_OB_SERVERS view for the server that you want to restart. If the value of the STATUS column remains ACTIVE but the value of the STOP_TIME column changes from NULL to the time when the service is stopped, the server has been isolated.

    For more information about how to query the oceanbase.DBA_OB_SERVERS view, see View a node.

  3. Run the following command to perform a minor compaction on the node to shorten the time required for redo log replay after the node is restarted and accelerate the restart process.

    obclient [(none)]> ALTER SYSTEM MINOR FREEZE SERVER = ('svr_ip:svr_port');
    

    The parameters are described as follows:

    • svr_ip: the IP address of the node to be restarted.

    • svr_port: the RPC port of the node to be restarted. The default value is 2882.

    Here is an example:

    obclient [(none)]> ALTER SYSTEM  MINOR FREEZE SERVER = ('172.xx.xx.xx:2882');
    

    After the minor compaction is completed, wait until the system state changes to Completed. For more information, see View the minor compaction status.

  4. Stop the observer process.

    1. Log in to the server where the observer process resides as the admin user.

    2. Use the command-line tool to navigate to the /home/admin/oceanbase directory.

      [admin@xxx /]$ cd /home/admin/oceanbase
      
    3. Run the following command to view and obtain the process ID of the node.

      [admin@xxx oceanbase]$ ps -ef | grep observer | grep -v grep
      admin    103364      1 99  2022 ?        51-17:24:41 /home/admin/oceanbase/bin/observer
      

      In this example, 103364 is the process ID of the node.

    4. Stop the observer process.

      The sample command is as follows:

      [admin@xxx oceanbase]$ kill -9 pid
      

      Here, pid is the observer process ID of the node to be stopped.

      Here is an example:

      [admin@xxx oceanbase]$ kill -9 103364
      

      Notice

      You can stop only one observer process in a deployment directory. If you want to stop observer processes on multiple nodes, you need to log in to each server in sequence.

    5. Run the following command to check whether the observer process has stopped.

      [admin@xxx oceanbase]$ ps aux | grep observer
      

      If no information is returned after the command execution, the observer process has been stopped successfully.

  5. (Optional) If you want to perform maintenance on the server, perform the maintenance in this step.

  6. Start the observer process.

    1. Log in to the server where the observer process resides as the admin user.

    2. Start the observer process.

      [admin@xxx oceanbase]$ cd /home/admin/oceanbase  &&  ./bin/observer
      

      Notice

      You can start only one observer process in a deployment directory. If you want to start observer processes on multiple nodes, you need to log in to each server in sequence.

      After the execution is successful, query the START_SERVICE_TIME column of the oceanbase.DBA_OB_SERVERS view. If this column contains a value other than NULL, the observer process has been started successfully.

  7. Run the following command to start the services of the node.

    obclient [(none)]> ALTER SYSTEM START SERVER 'svr_ip:svr_port';
    

    where:

    • svr_ip: the IP address of the node to be started.

    • svr_port: the RPC port of the node to be started. The default value is 2882.

    Here is an example:

    obclient [(none)]> ALTER SYSTEM START SERVER '172.xx.xx.xx:2882';
    

    After the execution is successful, query the STOP_TIME column of the oceanbase.DBA_OB_SERVERS view. If this column contains NULL, the node has started services and is ready to provide services.

    For more information about how to query the oceanbase.DBA_OB_SERVERS view, see View a node.

References

For more information about node O&M, see the following topics:

Contact Us