ob_cluster_no_merge

2023-08-22 02:51:01  Updated

Description

This alert is triggered when the time interval between two major compaction tasks of the OceanBase cluster managed by OceanBase Cloud Platform (OCP) exceeds the threshold. The default threshold is 108000s.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_cluster_no_merge_seconds
Source SQL: select zone, name, value, time_to_usec(now()) from __all_zone;
Note:
  • The value of the value field at merge_start_time is used as the value of the zone_value metric.
  • The value of the time_to_usec(now()) field at merge_start_time is used as the value of the current_timestamp metric.
  • Collected metrics (unit: us) current_timestamp and zone_value
    Metric expression (max(current_timestamp{metric_group="all_zone",name="merge_start_time",@LABELS}) by (@GBLABELS) - max(zone_value{metric_group="all_zone",name="merge_start_time",@LABELS}) by (@GBLABELS)) / 1000000
    Collection cycle 1 second

    The value of the metric ob_cluster_no_merge_seconds indicates the time interval between two major compaction tasks of the OceanBase cluster. When this value is greater than the threshold, this alert is triggered. The default threshold is 108000s.

    Alert rule

    Metric Default threshold (unit: s) Duration Detection cycle Time before clearance
    ob_cluster_no_merge_seconds 108000 0 seconds 60 seconds 5 minutes

    Alert information

    Trigger method Alert level Scope
    Metric expression Critical Cluster

    Alert templates

    • Overview: ${alarm_target} ${alarm_name}

    • Details: ${alarm_target} ${alarm_name}. The last major compaction was ${value}s ago, exceeding the threshold of ${alarm_threshold}s.

    • Overview example: ob_cluster=C1-1000. OceanBase cluster compaction detection failed.

    • Details example: ob_cluster=C1-1000. OceanBase cluster compaction detection failed. The last major compaction was 108001.0s ago, exceeding the threshold of 108000.0s.

    Impact on the system

    RootService of the OceanBase cluster freezes data regularly. If RootService does not freeze or compact the data for a long time, the disk usage increases, causing the disk space to be fully used and affecting writes of the application.

    Possible causes

    This problem is commonly found in cases where the RootService is suspended because RootService is leaderless or the process is suspended due to an exception.

    Suggested solution

    The solution is the same as that for the ob_cluster_no_frozen alert.

    Contact Us