Alert description
When an OCP node goes offline, other OCPs in the same OCP cluster will trigger this alert.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | ocp_distributed_server_deleted_count: This metric indicates that when an OCP node goes offline, other OCPs in the same cluster trigger this alert. |
| Monitoring Expression | sum(ocp_distributed_server_change{type="DELETE", @LABELS}) by (@GBLABELS) |
| Metric Collection | ocp_distributed_server_change |
| Metric Source | Collection SQL:SELECT * FROM DISTRIBUTED_SERVER WHERE TIMESTAMPADD(SECOND, ?, UPDATE_TIME) < CURRENT_TIMESTAMWhere: distributed_serverThe table is used to store OCP node data. When an OCP node goes offline, other OCP nodes will clean up the expired node.?The default value is 60 seconds. You can modify it by using theocp.distributed.server.expire.secondsConfigure the parameters. |
| Collection Cycle | None. It is triggered immediately after the OCP node goes offline. |
Rule information
Monitoring Metrics |
Default Threshold (Unit: Count) |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| ocp_distributed_server_deleted_count | 0 | 0 Seconds | 10 Seconds | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | Critical | service |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:svr_ip=xx.xx.xx.xx:svr_port=8080:target_server=xx.xx.xx.xx:8080 OCP node offline
Alert Details
- Template: Alert: ${alarm_name}, Node: ${target_server}.
- Example: Alert: OCP node offline, Node: xx.xx.xx.xx:8080.
Alert recovery
- Template: Alert: ${alarm_name}, OCP Node Offline: ${value_shown}
- Example: Alert: OCP Node Offline, OCP Node Offline: 1
Impact on the system
Daily operations and maintenance on decommissioned OCP nodes will no longer be possible.
Possible causes
- OCP is killed.
- OCP restarts sluggishly.
Solution
Restart the OCP process and check whether the Springboot startup process is recorded in the ocp.log file located in the /home/admin/ocp-server/log directory.
