How do I troubleshoot the OMS server down issue?

2024-12-13 06:57:09  Updated

This topic describes how to troubleshoot and restore OceanBase Migration Service (OMS) when a server fails.

Symptom

The state of a server is Down on the page appears after you choose OPS & Monitoring > Servers in the OMS console, or no enough host resource for XXX is returned when a data migration or synchronization task is being executed

Possible causes

  1. The config.yaml configuration file of OMS is incorrect, or the deployment is not completed. 90% of failures occur because of incomplete deployment of OMS.

  2. A control component of OMS exits unexpectedly.

  3. The MetaDB on which the OMS control components depend cannot be connected or is unavailable.

Troubleshooting procedure

Notice:

Before troubleshooting, make sure that:

  • You have carefully checked the content of the config.yaml configuration file, especially the cm_url and cm_nodes parameters.

  • You have initialized the container by executing bash /root/docker_init.sh in it.

  • You have logged on to the databases specified by the drc_cm_db and drc_cm_heartbeat_db parameters in the config.yaml configuration file.

  1. Log on to the OMS console. In the left-side navigation pane, choose OPS & Monitoring > Servers to check the server status.

    If the status is Online, check the resource usage of the server and stop tasks that have not been used for a long time to release the resources on the server.

    If the status is Down, proceed with the following steps to continue the troubleshooting procedure.

  2. Query the statuses of the OMS components in the container.

    Log on to the OMS container and run the supervisorctl status command to query the component statuses. If an OMS component is not in the RUNNING state, run the supervisorctl restart XXX command to restart the component. If all OMS components are in the RUNNING state, proceed with the following steps to continue the troubleshooting procedure.

    //Query the statuses of OMS components.
    supervisorctl status
    
    //Restart an OMS component. You replace XXX with the component name, such as oms_console, oms_drc_cm, oms_drc_supervisor, or nginx.
    supervisorctl restart XXX
    
  3. Check the IP address of the server.

    Log on to the OMS container and run the hostname -i command to verify whether the IP address is consistent with the actual IP address of the server. If not, run env | grep OMS_HOST_IP to check whether the OMS_HOST_IP parameter is specified when you start the OMS container. If the parameter is specified, proceed with the following steps to continue the troubleshooting procedure.

    Notice:

    If the multi-node deployment mode is used, make sure that the value of the OMS_HOST_IP parameter specified is the actual IP address of the current server when you start each OMS container. Do not use the same OMS_HOST_IP value for all OMS containers.

    //Query the IP address of the server.
    hostname -i
    
    //Query the OMS_HOST_IP environment variable.
    env | grep OMS_HOST_IP
    
  4. Log on to the database specified by the drc_cm_db parameter in the config.yaml configuration file, and query the host table for the server information.

    //Query the IP address and status of the OMS server.
    SELECT ip,host_status FROM host;
    
    //Example
    MySQL [drc_cm_db]> SELECT ip,host_status FROM  host;
    +---------------+-------------+
    | ip            | host_status |
    +---------------+-------------+
    | 100.XX.XX.107 | ONLINE      |
    +---------------+-------------+
    

    Check whether the IP address in the table is the same as the actual IP address of the server, and its status is ONLINE. If not, check whether the cm_nodes parameter in the config.yaml is correctly configured. If the parameter is correctly configured, proceed with the following steps to continue the troubleshooting procedure.

  5. Log on to the database specified by the drc_cm_heartbeat_db parameter in the config.yaml configuration file, and query the heart_beat table for the heartbeat information of the server.

    //Query the heartbeat database for the heartbeat information of the server.
    SELECT task_type,host_ip,gmt_created,gmt_modified FROM heart_beat where task_name='supervisor';
    
    //Example
    MySQL [drc_cm_heartbeat_db]> SELECT task_type,host_ip,gmt_created,gmt_modified FROM heart_beat where task_name='supervisor';
    +------------+---------------+---------------------+---------------------+
    | task_type  | host_ip       | gmt_created         | gmt_modified        |
    +------------+---------------+---------------------+---------------------+
    | supervisor | 100.XX.XX.107 | 2022-04-11 17:23:43 | 2022-04-28 21:39:51 |
    +------------+---------------+---------------------+---------------------+
    

    Check whether the IP address in the table is the same as the actual IP address of the server, and its gmt_modified value is the current time. If not, check whether the actual IP address of the current server is specified for the -e OMS_HOST_IP parameter when you start the current OMS container.

If the server remains down, submit the preceding query results to the OMS service engineers for troubleshooting.

Contact Us