Alert description
This alert monitors whether an OBServer node in the OceanBase cluster is in the stopped state. The alert is triggered when the OBServer's stop_time is not null (i.e., a STOP SERVER operation has been performed), and the downtime exceeds 0 seconds.
Alert principle
Parameter |
Value |
|---|---|
| Monitoring Metrics | ob_server_stopped_duration_seconds |
| Monitoring Expression | max(ob_server_stopped_duration_seconds{@LABELS}) by (@GBLABELS) |
| Metric Collection | ob_server_stopped_duration_seconds |
| Metric Source | Internal Views Collected by OCP-Agent from OBServer |
| Collection Cycle | N/A |
OCP-Agent periodically queries the service status of OBServer nodes via SQL. The specific principle is as follows:
For OceanBase Database versions earlier than V4.0, OCP-Agent queries the
__all_serverview to calculate the downtime based on thestop_timefield: ifstop_timeis 0, the metric value is 0; otherwise, the metric value is the number of seconds from the current time tostop_time.For OceanBase Database V4.0 and later, OCP-Agent queries the
DBA_OB_SERVERSview to calculate the downtime based on theSTOP_TIMEfield: ifSTOP_TIMEis NULL, the metric value is 0; otherwise, the metric value is the second difference between the current time andSTOP_TIME.The status field is also collected as a label to reflect the current operating status of the OBServer.
When the metric value ob_server_stopped_duration_seconds > 0, the trigger condition is met and an alert is generated.
Rule information
Monitoring Metrics |
Default Threshold |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| ob_server_stopped_duration_seconds | 0 | 60 Seconds | 10 Seconds | 300 Seconds |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | Downtime | Host |
Alert template
Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster1:host=xxx.xxx.xxx.xxx OceanBase server stopped service
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, OceanBase Server Status: ${status}, Alert: ${alarm_name}, Service Disruption Duration: ${value_shown}.
- Example: Cluster: obcluster1, Host: xxx.xxx.xxx.xxx, OceanBase Server Status: stopped, Alert: OceanBase server service stopped, Service stopped for: 120s.
Alert recovery
- Template: Alert: ${alarm_name}, OBServer Downtime: ${value_shown}
- Example: Alert: Server service stopped, OBServer service stop duration: 0
Impact on the system
When an OBServer node stops providing services, it no longer offers database services externally. This may result in the following impacts:
Service interruption: The leader replica hosted on this node needs to be elected and switched over, causing the related partition to be temporarily unavailable during the switchover.
Reduced availability: The number of available nodes in the cluster decreases. If multiple nodes fail simultaneously, a majority of Paxos votes may not be achievable, rendering the entire cluster unavailable.
Load imbalance: Traffic from the failed node is diverted to other nodes, potentially increasing their load.
Possible causes
The O&M engineer has actively executed the
ALTER SYSTEM STOP SERVERoperation (such as for scheduled maintenance or node isolation).The node has been stopped through the OCP console.
The status of an OBServer process is marked as stopped after it exits abnormally.
Solution
Confirm the cause of service interruption: Check whether the node is down due to scheduled maintenance. If it is a planned operation, you can ignore this alert and wait for the maintenance to complete before the service recovers.
Check node status: Query the status of each node in the cluster by executing the following SQL statement:
SELECT SVR_IP, SVR_PORT, ZONE, STATUS, START_SERVICE_TIME, STOP_TIME FROM oceanbase.DBA_OB_SERVERS;Restore node service:
Method 1 (recommended): Start the service for the node through the OCP console.
Method 2: Start the node by using an SQL command:
ALTER SYSTEM START SERVER 'xxx.xxx.xxx.xxx:2882';
Check node health: Before restoring service, ensure the observer process on the node is running normally, network connectivity is intact, and log synchronization is not delayed:
ps -ef | grep observerSELECT SVR_IP, ROLE, SCN_TO_TIMESTAMP(END_SCN) FROM oceanbase.GV$OB_LOG_STAT WHERE TENANT_ID = 1 ORDER BY LS_ID, ROLE;Verify the fix: After the service is restored, confirm that the node STATUS has changed to ACTIVE. The alert will automatically clear within the clearance period (300 seconds).
