Alert description
Note
This alert takes effect only for OceanBase clusters of version V4.0.0.0 or later.This alert is used to identify the risk scenario where the number of log stream (LS) replicas within a tenant is insufficient. A log stream is considered in a replica deficit state when its actual number of replicas is less than the expected number derived from the tenant's locality. The alert is triggered whenever any log stream in the tenant is in a replica deficit state.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | tenant_log_stream_replica_absent_count: The number of log streams with missing replicas in the tenant. An alert is triggered when this value exceeds the threshold. |
| Monitoring Expression | max(log_stream_replica_absent_count{@LABELS}) by (@GBLABELS) |
| Metric Collection | log_stream_replica_absent_count |
| Metric Source | The OCP Agent executes SQL on the RootService node to count the number of log streams with missing replicas for each tenant (log_stream_replica_absent_count). Collection SQL: ```sql |
| SELECT /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(5000000) */ t.tenant_id, t.tenant_name, COUNT(DISTINCT CASE WHEN (r.replica_count IS NULL OR r.replica_count < t.expected_replica_count) THEN ls.ls_id ELSE NULL END) as cnt FROM (SELECT tenant_id, tenant_name, LENGTH(locality) - LENGTH(REPLACE(locality, '@', '')) as expected_replica_count FROM DBA_OB_TENANTS WHERE tenant_type IN ('SYS', 'USER') AND locality IS NOT NULL) t INNER JOIN CDB_OB_LS ls ON t.tenant_id = ls.tenant_id AND ls.status NOT IN ('creating', 'create_abort') LEFT JOIN (SELECT tenant_id, ls_id, COUNT(DISTINCT CONCAT(zone, ':', replica_type)) as replica_count FROM CDB_OB_LS_LOCATIONS WHERE zone IS NOT NULL AND replica_type IS NOT NULL GROUP BY tenant_id, ls_id) r ON ls.tenant_id = r.tenant_id AND ls.ls_id = r.ls_id GROUP BY t.tenant_id, t.tenant_name``` | |
| Collection Cycle | 60 Seconds |
Rule information
Monitoring Metrics |
Default Threshold (Unit: Count) |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| tenant_log_stream_replica_absent_count | 0 | 180 Seconds | 60 Seconds | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expression | Critical | Tenant |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=1:ob_cluster=xxx:tenant_name=tenant_a OceanBase tenant log stream replica missing
Alert details
- Template: Cluster: ${ob_cluster_name}, Tenant: ${tenant_name}, Alert: ${alarm_name}.
- Example: cluster: ob_cluster_x, tenant: tenant_a, alert: OceanBase tenant log stream replica missing.
Alert recovery
- Template: Alert: ${alarm_name}, Tenant Log Stream Replica Missing: ${value_shown}
- Example: Alert: OceanBase tenant log stream replica missing, Tenant log stream replica missing: 0
Impact on the system
Impact on the OceanBase tenant: Insufficient log stream replicas will reduce data high availability redundancy and may amplify service risk in fault scenarios.
Impact on OCP: OCP will continuously generate risk alerts at the tenant level, prompting the operations team to complete replica recovery as soon as possible.
Impact on business: The business may still be available in the short term, but the disaster recovery capability will be reduced in scenarios such as node or data center failures, posing further availability risks.
Possible causes
Some replica nodes are abnormally offline or unreachable.
The replica completion or migration task is not finished.
The tenant locality configuration does not match the current replica distribution.
Zone-level resource constraints prevent replicas from being created as expected.
Solution
Check whether the tenant locality and target replica strategy are correct.
Check the replica distribution of abnormal log streams in
CDB_OB_LS_LOCATIONSto locate the zone/node where the missing replicas are located.After restoring the abnormal node or replenishing resources, perform replica completion/balancing (such as rebalance) to restore the log stream replica count to the desired value.
Continuously monitor the metric
tenant_log_stream_replica_absent_count. The alert will automatically clear once this metric value returns to 0.
