host_unavailable

2024-03-22 09:34:34  Updated

Description

OceanBase Cloud Platform (OCP) sends keep-alive packets to OCP-Agent on a server every 60s. If OCP-Agent does not reply in 1s, this alert is triggered. OCP cannot collect performance information of the host or control the host.

Principle

OCP-Server sets a timed task that is performed on the host every minute by running a command. If an error occurred while running this command, OCP-Server checks the duration between the last time when the host was available and the current time. When this duration exceeds the threshold, this alert is triggered. The default threshold is one minute.

Note

The threshold is the value you set for the OCP system parameter ocp.host.check.unavailable-time-threshold. This parameter specifies the maximum allowed offline time of OCP-Agent, in milliseconds. When the offline duration of the host exceeds the threshold, this alert is triggered.

Alert rule

Metric Default threshold (unit: ms) Duration Detection cycle Time before clearance
None 60,000 0 seconds 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Timed task of OCP Stopped Host

Alert templates

  • Overview: ${alarm_target} ${alarm_name}

  • Details: Host: ${host}, Alert: ${alarm_name}. Please check if the ${host} is reachable, or whether the OCP Agent process is normal.

  • Overview example: service=OCP:svr_ip=xxx.xxx.xxx.xxx. The host is unavailable.

  • Details example: Host: xxx.xxx.xxx.xxx, Alert: The host is unavailable. Please check if the host-1 is reachable, or whether the OCP Agent process is normal.

${alarm_target} follows the ob_cluster=xxxxxxx:svr_ip=xxxxxx format. svr_ip indicates the IP address of the OBServer of the cluster that generated the alert.

Impact on the system

OCP cannot send remote commands to the target host. Therefore, the O&M feature of the host is unavailable.

Possible causes

This problem is commonly found in the following scenarios:

  • The target host is down due to an exception.

  • Communication between OCP and the host failed.

  • OCP-Agent quits unexpectedly.

Suggested solutions

  1. Check whether the host is down.

    Try to log on to the target host or run the ping xxx.xxx.xxx.xxx command on another host in the same CIDR block. xxx.xxx.xxx.xxx indicates the IP address of the target host.

    • If the logon fails or no data sending success message is returned after you run the ping command, it is very likely that the target host is down.

      Contact the O&M engineers to solve the issue.

    • If the logon succeeds and the host is reachable, the host is normal. It is likely that the alert was triggered by another issue.

  2. Check whether the network connection between OCP and the host is disconnected. If yes, contact a network engineer to resolve the issue.

    Ping the faulty host on the OCP host and then ping the OCP host on the faulty host.

    • If they are mutually reachable, the network connection is normal. It is likely that the alert was triggered by another issue.

    • If they are not mutually reachable, the network connection is faulty. We recommend that you contact a network engineer to resolve the issue.

      If no network engineer is available, you can troubleshoot and resolve the issue on your own. For more information, see Network troubleshooting.

  3. Check whether the OCP-Agent-related processes are running.

    Go to the Hosts list on the Host Overview page of the OCP console. Select the faulty host to go to the page of the host, and then check for any abnormal process on the OCP Agent tab.

    • If a process is abnormal, click Restart next to this process in the Processes list. You can also run the OCP-Agent O&M script to restart the OCP-Agent and restore the OCP-Agent service. For more information, see OCP-Agent O&M script.

    • If no process is abnormal but the alert persists, go to the next step.

  4. Contact OCP Technical Support to locate the issue.

Contact Us