Alert description
This alert monitors whether there are inactive OBServer nodes in the OceanBase cluster and triggers an alert if any.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
| Parameter | Value |
|---|---|
| Monitoring metric | ob_cluster_inactive_server_count |
| Data source | SQL statement: select group_concat(svr_ip SEPARATOR ',') as servers, status, count(1) as count from __all_server group by status;The value of inactive_server_count is obtained from the count field. |
| Metric to be collected | inactive_server_count |
| Monitoring metric expression | max(server_count{metric_group="all_server",status="inactive",@LABELS}) by (@GBLABELS) |
| Collection interval | 60 seconds |
OceanBase uses a heartbeat mechanism to determine the status of each OBServer node and updates the status in real time to the __all_server table. Each OBServer node can obtain information from this table.
The value of the monitoring metric ob_cluster_inactive_server_count indicates the number of inactive OBServer nodes in the OceanBase cluster. If this value is greater than 0, an alert is triggered.
Alert rule
| Monitoring metric | Default threshold | Duration | Detection interval | Elimination interval |
|---|---|---|---|---|
| ob_cluster_inactive_server_count | 0 | 0 seconds | 10 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the monitoring metric expression | Down | Cluster |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1 OceanBase cluster has inactive OBServer nodes
Alert details
- Template: Cluster: ${ob_cluster_name}, Alert: ${alarm_name}, Number of inactive OBServer nodes: ${value}, Inactive OBServer nodes: ${server_ips}.
- Example: Cluster: obcluster-1, Alert: OceanBase cluster has inactive OBServer nodes, Number of inactive OBServer nodes: 2.0, Inactive OBServer nodes: xxx.xxx.xxx.1,xxx.xxx.xxx.2.
Alert recovery
- Template: Alert: ${alarm_name}, Number of inactive OBServer nodes in OceanBase cluster: ${value}
- Example: Alert: OceanBase cluster has inactive OBServer nodes, Number of inactive OBServer nodes in OceanBase cluster: 0
Here, ${alarm_target} indicates the object that triggered the alert. The format is ob_cluster=xxxxxxx, where ob_cluster is the name of the cluster that triggered the alert.
Impact on the system
The number of available OBServer nodes decreases, leading to a drop in cluster availability.
For example, if one of the three nodes in a three-node cluster stops working, the three replicas will become two replicas. If another node stops working, the cluster will be unavailable.
Possible causes
This issue commonly occurs in the following situations:
Network communication failures.
OBServer node failures or abnormal process termination.
The observer process is alive but unresponsive, and heartbeats are not reported.
Troubleshooting procedure
Determine whether the OBServer node is still needed.
If yes, proceed to the next step to troubleshoot the issue.
If no, you can directly delete the OBServer node.
Check for network communication issues.
Run the following commands to check for network issues between the OBServer node and the leader OBServer node:
# Ping the leader OBServer node from the OBServer node. xxx.xxx.xxx.1 is an example IP address of the backup storage host. ping xxx.xxx.xxx.1 # Ping the OBServer node from the leader OBServer node. xxx.xxx.xxx.2 is an example IP address of the OCP host. ping xxx.xxx.xxx.2Note
You can run the following statement in the sys tenant of the OceanBase cluster to obtain the IP address of the leader OBServer node.
select svr_ip,with_rootserver from oceanbase.__all_server;If the ping command returns continuous data transmission information, the network is functioning properly, and the alert is likely caused by another issue.
If the ping command does not return continuous data transmission information, the network is down. Contact the network administrator to troubleshoot the specific network issue. If there is no network administrator, refer to Network troubleshooting for troubleshooting and resolution.
Check if the OBServer node is faulty or if the process has unexpectedly stopped.
Check if the ob_cannot_connected OB server cannot be connected alert is being reported.
If yes, the OBServer node is faulty or the process has unexpectedly stopped. Refer to ob_cannot_connected OB server cannot be connected for resolution.
If no, it may be caused by another issue. Proceed to the next step for further troubleshooting.
Check if the OBServer process is running but the node is not reporting heartbeats.
Log in to the corresponding OBServer node through the Hosts page in OCP and check if the OCP Agent tab shows that the process is running.
If the process is running but the OBServer node is still not connecting, it may be that the node is not reporting heartbeats. Suggest checking if the disk or memory is sufficient.
# Check the remaining memory. free -m # Check the remaining disk space in the OBServer node directory (which is usually /home/admin by default). df -hInsufficient disk or memory space can lead to the OBServer node not reporting heartbeats.
If the disk is full, expand the disk or clear out logs. You can also reduce the data retention period based on business needs.
If the memory is insufficient, upgrade the memory or reduce the tenant specifications based on business needs.
If the disk and memory are sufficient, proceed to the next step.
Contact OCP technical support to locate the issue.