This topic describes how to troubleshoot and restore OceanBase Migration Service (OMS) when a server fails.
Symptom
The state of a server is Down on the page appears after you choose OPS & Monitoring > Servers in the OMS console, or no enough host resource for XXX is returned when a data migration or synchronization project is being executed
Possible causes
The
config.yamlconfiguration file of OMS is incorrect, or the deployment is not completed. 90% of failures occur because of incomplete deployment of OMS.A control component of OMS exits unexpectedly.
The MetaDB on which the OMS control components depend cannot be connected or is unavailable.
Troubleshooting procedure
Notice:
Before troubleshooting, make sure that:
You have carefully checked the content of the
config.yamlconfiguration file, especially thecm_urlandcm_nodesparameters.You have initialized the container by executing
bash /root/docker_init.shin it.You have logged on to the databases specified by the
drc_cm_dbanddrc_cm_heartbeat_dbparameters in theconfig.yamlconfiguration file.
Log on to the OMS console. In the left-side navigation pane, choose OPS & Monitoring > Servers to check the server status.
If the status is Online, check the resource usage of the server and stop projects that have not been used for a long time to release the resources on the server. If the status is Down, proceed with the following steps to continue the troubleshooting procedure.
Query the statuses of the OMS components in the container.
Log on to the OMS container and run the
supervisorctl statuscommand to query the component statuses. If an OMS component is not in theRUNNINGstate, run thesupervisorctl restart XXXcommand to restart the component. If all OMS components are in the RUNNING state, proceed with the following steps to continue the troubleshooting procedure.//Query the statuses of OMS components. supervisorctl status //Restart an OMS component. You replace XXX with the component name, such as oms_console, oms_drc_cm, oms_drc_supervisor, or nginx. supervisorctl restart XXXCheck the IP address of the server.
Log on to the OMS container and run the
hostname -icommand to verify whether the IP address is consistent with the actual IP address of the server. If not, runenv | grep OMS_HOST_IPto check whether theOMS_HOST_IPparameter is specified when you start the OMS container. If the parameter is specified, proceed with the following steps to continue the troubleshooting procedure.Notice:
If the multi-node deployment mode is used, make sure that the value of the
OMS_HOST_IPparameter specified is the actual IP address of the current server when you start each OMS container. Do not use the sameOMS_HOST_IPvalue for all OMS containers.//Query the IP address of the server. hostname -i //Query the OMS_HOST_IP environment variable. env | grep OMS_HOST_IPLog on to the database specified by the
drc_cm_dbparameter in theconfig.yamlconfiguration file, and query thehosttable for the server information.//Query the IP address and status of the OMS server. SELECT ip,host_status FROM host; //Example MySQL [drc_cm_db]> SELECT ip,host_status FROM host; +---------------+-------------+ | ip | host_status | +---------------+-------------+ | 100.XX.XX.107 | ONLINE | +---------------+-------------+Check whether the IP address in the table is the same as the actual IP address of the server, and its status is
ONLINE. If not, check whether thecm_nodesparameter in theconfig.yamlis correctly configured. If the parameter is correctly configured, proceed with the following steps to continue the troubleshooting procedure.Log on to the database specified by the
drc_cm_heartbeat_dbparameter in theconfig.yamlconfiguration file, and query theheart_beattable for the heartbeat information of the server.//Query the heartbeat database for the heartbeat information of the server. SELECT task_type,host_ip,gmt_created,gmt_modified FROM heart_beat where task_name='supervisor'; //Example MySQL [drc_cm_heartbeat_db]> SELECT task_type,host_ip,gmt_created,gmt_modified FROM heart_beat where task_name='supervisor'; +------------+---------------+---------------------+---------------------+ | task_type | host_ip | gmt_created | gmt_modified | +------------+---------------+---------------------+---------------------+ | supervisor | 100.XX.XX.107 | 2022-04-11 17:23:43 | 2022-04-28 21:39:51 | +------------+---------------+---------------------+---------------------+Check whether the IP address in the table is the same as the actual IP address of the server, and its
gmt_modifiedvalue is the current time. If not, check whether the actual IP address of the current server is specified for the-e OMS_HOST_IPparameter when you start the current OMS container.
If the server remains down, submit the preceding query results to the OMS service engineers for troubleshooting.