Description
This alert is triggered when an OBServer is inactive in the OceanBase cluster.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_cluster_inactive_server_count |
| Source | SQL: unknow select group_concat(svr_ip SEPARATOR ',') as servers, status, count(1) as count from __all_server group by status; Note The value of the count field is used as the value of the inactive_server_count metric. |
| Collected metric | inactive_server_count |
| Metric expression | max(server_count{metric_group="all_server",status="inactive",@LABELS}) by (@GBLABELS) |
| Collection cycle | 60 seconds |
OceanBase Database determines the status of each OBServer based on its internal heartbeat mechanism, and updates the status in the __all_server table in real time. The table is accessible to all OBServers.
The value of the metric ob_cluster_inactive_server_count indicates the number of the inactive OBServers in the OceanBase cluster. When this value is greater than 0, this alert is triggered.
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_cluster_inactive_server_count | 0 | 0 seconds | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Cluster |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The number of inactive OBServers is ${value}. IP addresses of the inactive OBServers are ${servers}.
Overview example: ob_cluster=C1-1000. The OceanBase cluster has inactive OBServers.
Details example: ob_cluster=C1-1000. The OceanBase cluster has inactive OBServers. The number of inactive OBServers is 2.0. IP addresses of the inactive OBServers are 192.168.1.1 and 192.168.1.2.
${alarm_target} indicates the object that generated the alert, in the ob_cluster=xxxxxxx format. ob_cluster indicates the name of the cluster that generated the alert.
Impact on the system
When the number of available OBServers reduces, the cluster availability decreases.
For example, if one OBServer of a three-node cluster stops working, the original three-replica cluster becomes a two-replica cluster. If one more node stops working, the cluster becomes unavailable.
Possible causes
This problem is commonly found in the following scenarios:
A network communication error occurs.
The inactive OBServers failed or their observer processes unexpectedly stopped.
The inactive OBServers are alive but are unresponsive, without reporting their heartbeats.
Suggested solutions
Check whether an inactive OBServer is still needed.
If it is needed, go to the next step.
Otherwise, you can delete it.
Check the network connection.
Run the following command to check whether a network error exists between the inactive OBServer and the leader OBServer.
# Ping the leader OBServer from the inactive OBServer. The sample IP address of the leader OBServer is 192.168.0.1. ping 192.168.0.1 # Ping the inactive OBServer from the leader OBServer. Use the IP address of the OCP-Server that manages this inactive OBServer. The sample IP address of the OCP-Server is 192.168.0.2. ping 192.168.0.2Note
You can execute the following statement under the sys tenant of the OceanBase cluster to obtain the IP address of the leader OBServer.
select svr_ip,with_rootserver from oceanbase.__all_server;If the system consecutively returns data transmission messages, the network connection is normal and the alert may have been triggered by other issues.
Otherwise, the network connection is disconnected.
Contact your network administrator for troubleshooting or troubleshoot and fix the network failure on your own. For more information, see Network troubleshooting.
Check whether the OBServer fails or the process unexpectedly stops.
Check for the ob_cannot_connected alert.
If you find the alert, solve the problem first. For more information, see ob_cannot_connected.
Otherwise, go to the next step.
Check the observer process and heartbeat.
In the Hosts list on the Host Overview page of the OCP console, click the IP address of the target OBServer, and then check the process status on the OCP Agent tab.
If the process is normal but the OBServer cannot be connected, it is likely that its heartbeats are not reported. We recommend that you check whether the disk space or memory is sufficient.
# Check the available memory size. free -m # Check the available disk space in the OBServer home directory. The default home directory is /home/admin. df -hThe OBServer heartbeat is not reported due to insufficient disk space or memory.
If the disk space is insufficient, increase the disk capacity, delete unnecessary logs, or reduce the data retention period based on business conditions.
If the memory is insufficient, increase the memory capacity or reduce the resource specifications of tenants that use less resources than those allocated to them.
If the disk space and memory are sufficient, go to the next step.
Contact OCP Technical Support for help.