Description
This alert is triggered when the number of partitions on the OBServer exceeds the threshold. All partitions are counted, including the partitions of built-in tenants such as the sys tenant.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_host_partition_count
Note |
| Source | SQL: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(100000000) */ tenant_id, 1 as role, case when cnt is null then 0 else cnt end as cnt from (select tenant_id, count(*) as cnt from __all_virtual_partition_info where svr_ip = ? and svr_port = ? group by tenant_id) |
| Collected metric | partition_count |
| Metric expression | sum(ob_partition_num{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 second |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_host_partition_count | 30000 | 0 seconds | 30 seconds | 15 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Server |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: Cluster:${ob_cluster_name}, Host:${host}, Alert:${alarm_name}. The partition number is ${value}, exceeding the threshold of ${alarm_threshold}.
Overview example: ob_cluster=C1-1000:svr_ip=xxx.xxx.xxx.xxx. The partition number of the OBServer exceeds the threshold.
Details example: Cluster:ob_cluster=C1-1000, Host:xxx.xxx.xxx.xxx, Alert:The partition number of the OBServer exceeds the threshold, The partition number is 30001.0, exceeding the threshold of 30000.0.
Impact on the system
The heartbeat Remote Procedure Calls (RPCs) between replicas consume network resources.
When the partition number exceeds the threshold, users cannot create tables or add partitions, and the internal partition balance will be affected.
Possible causes
This problem is commonly found in the following scenarios:
You have created a large number of tables.
The application uses many partition tables. The tables are partitioned by time, causing the partition number to constantly increase.
Suggested solutions
You can select one of the following solutions based on the actual situation.
Delete unwanted tenants, databases, and tables, empty the recycle bin, and perform two rounds of major compaction to reduce the number of partition replicas.
Find the tenant with the most partition replicas.
-- Run the following command to query the top 10 tenants with the most replicas. SELECT t2.tenant_name, t1.replica_count FROM (SELECT tenant_id, COUNT(*) AS replica_count FROM __all_virtual_partition_info GROUP BY tenant_id ORDER BY replica_count DESC LIMIT 10) t1 JOIN (SELECT tenant_id, tenant_name FROM __all_tenant) t2 ON t1.tenant_id=t2.tenant_id ORDER BY replica_count DESC;Run the following commands to delete data:
-- Drop a tenant. -- Make sure that the tenants to be dropped are unwanted. DROP TENANT IF EXISTS `your tenant name`; -- Drop the database. DROP DATABASE IF EXISTS `your database name`; -- Drop a table. DROP TABLE IF EXISTS `your table name`; -- Purge the specified database from the recycle bin. PURGE DATABASE `object_name`; -- Purge the specified table from the recycle bin. PURGE TABLE `object_name`; -- Purge the entire recycle bin. PURGE RECYCLEBIN; -- Start major compaction. ALTER SYSTEM MAJOR FREEZE;
Move units to another OBServer. If no other OBServer is available, add one to scale out the cluster. For more information about cluster scale-out, see Add an OBServer in User Guide.