Alert description
This alert monitors the number of partitions with insufficient replicas in a tenant. It triggers when the number of such partitions exceeds the specified threshold.
Alert mechanism
| Parameter | Value |
|---|---|
| Monitoring metric | tenant_partition_replica_absent_count |
| Metric source | Query the virtual table: SELECT /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(20000000) */ tenant.tenant_id, tenant.tenant_name, IFNULL(stat.cnt, 0) cnt FROM __all_tenant tenant LEFT JOIN (SELECT table_id>>40 AS tenant_id, COUNT(1) cnt FROM __all_virtual_election_info WHERE member_list NOT LIKE CONCAT(replica_num,'{%') AND SUBSTR(member_list, 1, 1) != '0' GROUP BY tenant_id) stat ON tenant.tenant_id=stat.tenant_id where stat.tenant_id not in (select tenant_id from __all_rootservice_job where job_type='ALTER_TENANT_LOCALITY' and job_status='INPROGRESS') |
| Sampling metric | partition_replica_absent_count |
| Monitoring expression | max(partition_replica_absent_count{}) by (ob_cluster_name,ob_cluster_id,tenant_name,ob_tenant_id) |
| Sampling interval | 60 seconds |
Rule information
| Monitoring expression | Meaning of the monitoring metric | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| tenant_partition_replica_absent_count > 100 | Number of partitions with insufficient replicas in a tenant | 100 | 20 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Expression based on monitoring metrics | Severe | Tenant |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=TEST-1:tenant_name=t1 OceanBase Tenant Partition Replica Absent
Alert details
- Template: Cluster: ${ob_cluster_name}, Tenant: ${tenant_name}, Alert: ${alarm_name}.
- Example: Cluster: TEST, Tenant: t1, Alert: OceanBase Tenant Partition Replica Absent.
Alert recovery
- Template: Alert: ${alarm_name}, Number of partitions with insufficient replicas in a tenant: ${recover_value}
- Example: Alert: OceanBase Tenant Partition Replica Absent, Number of partitions with insufficient replicas in a tenant: 85
Impact on the system
If a partition has insufficient replicas, the last merged version cannot be advanced, which prevents a major compaction.
Possible causes
The target partition has insufficient replicas because the server has permanently gone offline.
Procedure
After the member is permanently offline, the target partition is removed from the member list, but the source partition does not kick out members. After the target partition is backed up, you can perform migration replication to make up for the missing replicas.
Check whether the current cluster has partitions with missing replicas:
select * from __all_virtual_replica_task;If there are entries for the corresponding zone-related server, it indicates that the related partition needs to initiate a load balancing task. If no load balancing task is currently running, assuming the target version to be merged is 25, you can check the
data_version != 25replicas in the meta table to narrow down the troubleshooting scope. Here, the meta table refers to a set of tables. Generally, you can check the last-level meta table (__all_meta_tableor__all_virtual_meta_table) for replicas whose data versions have not been pushed. If none are found, you can check the next-level meta table.- For versions earlier than 2.0, the meta tables are
__all_virtual_core_meta_table,__all_virtual_core_root_table,__all_root_table, and__all_meta_table. - For versions 2.0 and later, the meta tables are
__all_virtual_core_meta_table,__all_virtual_core_root_table,__all_root_table, and__all_virtual_meta_table. - Query the
__all_virtual_partition_tablefor replicas withdata_version != 25. - Run the command
grep "replica not merged to version" rootservice.lon the RS server.
- For versions earlier than 2.0, the meta tables are