ob_host_partition_count_over_threshold

2025-07-02 07:30:55  Updated

Description

This alert is triggered when the number of partitions on the OBServer exceeds the threshold. All partitions are counted, including the partitions of built-in tenants such as the sys tenant.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_host_partition_count
Note
The value indicates the number of OBServer partitions. By default, the alert is triggered when the number of partitions exceeds 30,000.
Source SQL: select /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) QUERY_TIMEOUT(100000000) */ tenant_id, 1 as role, case when cnt is null then 0 else cnt end as cnt from (select tenant_id, count(*) as cnt from __all_virtual_partition_info where svr_ip = ? and svr_port = ? group by tenant_id)
Collected metric partition_count
Metric expression sum(ob_partition_num{@LABELS}) by (@GBLABELS)
Collection cycle 1 second

Alert rule

Metric Default threshold Duration Detection cycle Time before clearance
ob_host_partition_count 30000 0 seconds 30 seconds 15 minutes

Alert information

Trigger method Alert level Scope
Metric expression Critical Server

Alert templates

  • Overview: ${alarm_target} ${alarm_name}

  • Details: Cluster:${ob_cluster_name}, Host:${host}, Alert:${alarm_name}. The partition number is ${value}, exceeding the threshold of ${alarm_threshold}.

  • Overview example: ob_cluster=C1-1000:svr_ip=192.168.1.1. The partition number of the OBServer exceeds the threshold.

  • Details example: Cluster:ob_cluster=C1-1000, Host:192.168.1.1, Alert:The partition number of the OBServer exceeds the threshold, The partition number is 30001.0, exceeding the threshold of 30000.0.

Impact on the system

  • The heartbeat Remote Procedure Calls (RPCs) between replicas consume network resources.

  • When the partition number exceeds the threshold, users cannot create tables or add partitions, and the internal partition balance will be affected.

Possible causes

This problem is commonly found in the following scenarios:

  • You have created a large number of tables.

  • The application uses many partition tables. The tables are partitioned by time, causing the partition number to constantly increase.

Suggested solutions

You can select one of the following solutions based on the actual situation.

  • Delete unwanted tenants, databases, and tables, empty the recycle bin, and perform two rounds of major compaction to reduce the number of partition replicas.

    1. Find the tenant with the most partition replicas.

      -- Run the following command to query the top 10 tenants with the most replicas.
      SELECT t2.tenant_name, t1.replica_count
      FROM 
       (SELECT tenant_id, COUNT(*) AS replica_count
        FROM __all_virtual_partition_info
        GROUP BY tenant_id
        ORDER BY replica_count DESC
        LIMIT 10) t1
      JOIN
       (SELECT tenant_id, tenant_name
        FROM __all_tenant) t2
      ON t1.tenant_id=t2.tenant_id
      ORDER BY replica_count DESC;
      
    2. Run the following commands to delete data:

      -- Drop a tenant.
      -- Make sure that the tenants to be dropped are unwanted. 
      DROP TENANT IF EXISTS `your tenant name`;
      
      -- Drop the database.
      DROP DATABASE IF EXISTS `your database name`;
      
      -- Drop a table.
      DROP TABLE IF EXISTS `your table name`;
      
      -- Purge the specified database from the recycle bin.
      PURGE DATABASE `object_name`;
      
      -- Purge the specified table from the recycle bin.
      PURGE TABLE `object_name`;
      
      -- Purge the entire recycle bin.
      PURGE RECYCLEBIN;
      
      -- Start major compaction.
      ALTER SYSTEM MAJOR FREEZE;
      
  • Move units to another OBServer. If no other OBServer is available, add one to scale out the cluster. For more information about cluster scale-out, see Add an OBServer in User Guide.

Contact Us