Description
OceanBase Cloud Platform (OCP) sends keep-alive packets to OCP-Agent on a server every 60s. If OCP-Agent does not reply in 1s, this alert is triggered. OCP cannot collect performance information of the host or control the host.
Principle
OCP-Server sets a timed task that is performed on the host every minute by running a command. If an error occurred while running this command, OCP-Server checks the duration between the last time when the host was available and the current time. When this duration exceeds the threshold, this alert is triggered. The default threshold is one minute.
Note
The threshold is the value you set for the OCP system parameter ocp.host.check.unavailable-time-threshold. This parameter specifies the maximum allowed offline time of OCP-Agent, in milliseconds. When the offline duration of the host exceeds the threshold, this alert is triggered.
Alert rule
| Metric | Default threshold (unit: ms) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| None | 60,000 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Timed task of OCP | Stopped | Host |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: Host: ${host}, Alert: ${alarm_name}. Please check if the ${host} is reachable, or whether the OCP Agent process is normal.
Overview example: service=OCP:svr_ip=xxx.xxx.xxx.xxx. The host is unavailable.
Details example: Host: xxx.xxx.xxx.xxx, Alert: The host is unavailable. Please check if the host-1 is reachable, or whether the OCP Agent process is normal.
${alarm_target} follows the ob_cluster=xxxxxxx:svr_ip=xxxxxx format. svr_ip indicates the IP address of the OBServer of the cluster that generated the alert.
Impact on the system
OCP cannot send remote commands to the target host. Therefore, the O&M feature of the host is unavailable.
Possible causes
This problem is commonly found in the following scenarios:
The target host is down due to an exception.
Communication between OCP and the host failed.
OCP-Agent quits unexpectedly.
Suggested solutions
Check whether the host is down.
Try to log on to the target host or run the
ping xxx.xxx.xxx.xxxcommand on another host in the same CIDR block. xxx.xxx.xxx.xxx indicates the IP address of the target host.If the logon fails or no data sending success message is returned after you run the
pingcommand, it is very likely that the target host is down.Contact the O&M engineers to solve the issue.
If the logon succeeds and the host is reachable, the host is normal. It is likely that the alert was triggered by another issue.
Check whether the network connection between OCP and the host is disconnected. If yes, contact a network engineer to resolve the issue.
Ping the faulty host on the OCP host and then ping the OCP host on the faulty host.
If they are mutually reachable, the network connection is normal. It is likely that the alert was triggered by another issue.
If they are not mutually reachable, the network connection is faulty. We recommend that you contact a network engineer to resolve the issue.
If no network engineer is available, you can troubleshoot and resolve the issue on your own. For more information, see Network troubleshooting.
Check whether the OCP-Agent-related processes are running.
Go to the
Hosts list on the Host Overview page of the OCP console. Select the faulty host to go to the page of the host, and then check for any abnormal process on the OCP Agent tab.If a process is abnormal, click
Restart next to this process in theProcesses list. You can also run the OCP-Agent O&M script to restart the OCP-Agent and restore the OCP-Agent service. For more information, see OCP-Agent O&M script.If no process is abnormal but the alert persists, go to the next step.
Contact OCP Technical Support to locate the issue.