This topic describes how to troubleshoot host issues in an OceanBase cluster.
Scenarios
OceanBase Database features an architecture of multiple replicas and ensures service availability when some nodes fail. For example, the three-replica architecture ensures that healthy OBServer nodes can constitute the majority of replicas when any node fails, and the five-replica architecture ensures this condition when any two nodes fail. However, you must promptly troubleshoot a failed host in an OceanBase cluster to prevent the failure of more nodes, which affects the constitution of the majority of replicas.
The method that you can use to fix a failed host varies based on the following scenarios:
A buffer host is available.
No buffer host is available.
You need to troubleshoot only CPU or memory issues. Data rebuilding is not required.
You need to troubleshoot disk issues. Data rebuilding is required.
This topic describes troubleshooting operations in a given scenario.
Prerequisites
A host has failed in an OceanBase cluster. When an OBServer node becomes unavailable and the host goes offline, it is likely that the host is down due to a fault.
Procedure
A buffer host is available
Add the buffer host to OceanBase Cloud Platform (OCP).
In the OBServers list of the Cluster page, find the failed OBServer node and click Replace in the Actions column.
Notice
- You must first add an OBServer node and then remove the failed OBServer node. The status of the failed OBServer node may not be online.
- When you replace an OBServer node, you are prompted to determine whether to skip the host O&M operations. In this step, select Yes.
- Before you replace an OBServer node, make sure that the clock source of all the hosts that you want to add is the same as that of the OceanBase cluster, and the time difference does not exceed 10 ms. Use the following method to check the time difference: On the host whose RootService role is Leader in the OceanBase cluster, run the clockdiff command by using the IP addresses of the hosts that you want to add.
- The resource specifications of the new OBServer node must be greater than or equal to those of the OBServer node to be replaced. This ensures that all resource units and data can be migrated. If you have special requirements, perform a separate assessment.
If the task of removing an OBServer node stays in progress for a long time, it is possible that data migration is affected by automatic data balancing. You can modify the
resource_soft_limitparameter to disable unit balancing. Here is a sample command. For more information about the parameter, see resource_soft_limit.# Query the resource_soft_limit parameter and record the parameter value for rollback later. show parameters like "%resource_soft_limit%"; # Disable unit balancing. alter system set resource_soft_limit=100;If data replication is slow and the network bandwidth is sufficient, you can set the parameters that specify the data migration concurrency to larger values.
-- In OceanBase Database V4.0 and earlier, you can use the following parameters to specify the replica migration concurrency: alter system set data_copy_concurrency = 100; alter system set server_data_copy_out_concurrency = 40; alter system set server_data_copy_in_concurrency = 40; -- In versions later than OceanBase Database V4.0, you can use the following parameters to specify the replica migration concurrency for a tenant: alter system set ha_high_thread_score = 80; alter system set ha_mid_thread_score = 80;For more information about the parameters, see the following topics:
OceanBase Database V4.0 and earlier: data_copy_concurrency, server_data_copy_out_concurrency, and server_data_copy_in_concurrency
Versions later than OceanBase Database V4.0: ha_high_thread_score and ha_mid_thread_score
Notice
To prevent the impact on cluster performance, you need to reset the parameters to their original values immediately after the task succeeds.
In the Hosts list, remove the offline failed host.
No buffer host is available, and you need to troubleshoot only CPU or memory issues without data rebuilding
In the OBServers list of the Cluster page, stop the failed OBServer node and click Replace in the Actions column.
Click More in the Actions column and select Stop Process. The OBServer node stops working. This operation allows OCP to recognize that the OBServer node is stopped for maintenance.
Shut down the failed host for maintenance. After the maintenance is completed, start the host.
Check the host status. If the host is in the Offline state, reinstall OCP Agent.
After OCP Agent is reinstalled, the host should be in the Online state.
In the OBServers list of the Cluster page, find the repaired OBServer node and click Restart in the Actions column.
After the OBServer node is started, it should be in the Running state.
No buffer host is available and data rebuilding is required
In the OBServers list of the Cluster page, find the failed OBServer node and click Delete in the Actions column.
Notice
In the following scenarios, you cannot remove the failed OBServer node and must add a buffer host:
- Only one OBServer node exists in each zone, and the failed OBServer node contains data replicas or resource units.
- The remaining OBServer nodes cannot accommodate the data replicas or resource units of all tenants.
- The remaining healthy OBServer nodes cannot constitute the majority of replicas.
If the task of removing an OBServer node stays in progress for a long time, it is possible that data migration is affected by automatic data balancing. You can modify the
resource_soft_limitparameter to disable unit balancing. Here is a sample command. For more information about the parameter, see resource_soft_limit.# Query the resource_soft_limit parameter and record the parameter value for rollback later. show parameters like "%resource_soft_limit%"; # Disable unit balancing. alter system set resource_soft_limit=100;If data replication is slow and the network bandwidth is sufficient, you can set the parameters that specify the data migration concurrency to larger values.
# Query the parameters that specify the data migration concurrency in the cluster and record the parameter values for rollback later. show parameters like "%data_copy_concurrency%"; show parameters like "%server_data_copy_out_concurrency%"; show parameters like "%server_data_copy_in_concurrency%"; # Set the parameters to larger values. alter system set data_copy_concurrency=100; alter system set server_data_copy_out_concurrency=40; alter system set server_data_copy_in_concurrency=40;In the Hosts list, remove the offline failed host.
Shut down the failed host for maintenance. After the maintenance is completed, start the host.
Add the repaired host to OCP.
Add the OBServer node to the OceanBase cluster again.