ob_cluster_no_frozen

2023-08-15 11:20:56  Updated

Description

This alert is triggered when the time interval between two freeze tasks of the OceanBase cluster managed by OceanBase Cloud Platform (OCP) exceeds the threshold. The default threshold is 90000s.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_cluster_no_frozen_seconds
Source SQL: javascript select zone, name, value, time_to_usec(now()) from __all_zone; Note * The value of the value field at frozen_time is used as the value of the zone_value metric. * The value of the time_to_usec(now()) field at frozen_time is used as the value of the current_timestamp metric.
Collected metrics (unit: us) current_timestamp and zone_value
Metric expression (max(current_timestamp{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS) - max(zone_value{metric_group="all_zone",name="frozen_time",@LABELS}) by (@GBLABELS)) / 1000000
Collection cycle 1 second

The value of the metric ob_cluster_no_frozen_seconds indicates the time interval between two freeze tasks of the OceanBase cluster. When this value is greater than the threshold, this alert is triggered. The default threshold is 90000s.

Alert rule

Metric Default threshold (unit: s) Duration Detection cycle Time before clearance
ob_cluster_no_frozen_seconds 90000 0 seconds 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Metric expression Critical Cluster

Alert templates

  • Overview: ${alarm_target} ${alarm_name}

  • Details: ${alarm_target} ${alarm_name}. The last freeze was ${value}s ago, exceeding the threshold of ${alarm_threshold}s.

  • Overview example: ob_cluster=C1-1000. OceanBase cluster freeze detection failed.

  • Details example: ob_cluster=C1-1000. OceanBase cluster freeze detection failed. The last freeze was 90001.0s ago, exceeding the threshold of 90000.0s.

Impact on the system

If RootService does not freeze or compact the data for a long time, the disk usage increases, causing the disk space to be fully used and affecting writes of the application.

Possible causes

  • The daily major compaction feature is disabled or the upgrade mode parameter is enabled.

  • RootService is suspended because RootService is leaderless or the process is suspended due to an exception.

Suggested solutions

  1. Check the daily major compaction feature.In the OCP console, choose Cluster Overview > Compaction Management > Configuration for Major Freeze . In the Major Freeze Strategy section, check whether the Daily Major Freeze Time is specified.

    • If the daily major freeze time is not specified, automatic major freeze will not be triggered. You need to manually enable the major freeze.

    • If the daily major freeze time is specified, go to the next step.

  2. Check whether the RootService service problem is caused by a process exception.

    An observer process exception also triggers the ob_cluster_exists_inactive_server alert at the same time. We recommend that you solve the problem that caused that alert first and then check whether the ob_cluster_no_frozen alert recurs 5 minutes later.

  3. Check whether RootService is leaderless.

    Execute the following statement to query the __all_virtual_core_meta_table table:

    -- Connect to the sys tenant to run the query
    SELECT svr_ip, zone, role, member_list FROM __all_virtual_core_meta_table;
    
    • If the sys tenant is normally connected but the query fails, RootService is leaderless.

      Restart OBServer where RootService is deployed or restart all OBServers on the rootservice_list.

      1. Run the following commands to obtain the IP address of the corresponding OBServer:

        -- Connect to the sys tenant to run the query
        
        -- Query the OBServer where RootService is deployed
        SELECT zone, svr_ip, svr_port
        FROM __all_server WHERE with_rootserver=1;
        
        -- Query rootservice_list
        -- OBServers on the rootservice_list are separated by semicolons (;). Each part represents an OBServer.
        SELECT DISTINCT `value` AS rootservice_list
        FROM __all_virtual_sys_parameter_stat
        WHERE `name` = 'rootservice_list';
        
      2. Find the target OBServer in the OBServers list on the Overview page of the cluster, and then click Restart .

        Note

        If you select Force Restart, the OBServer will be restarted by ending the observer process. When no response is received from the majority of nodes, you need to select this option to restart the OBServer. If you restart the OBServer when no response from the majority of nodes is received, some replicas may be leaderless. You will need to wait for 15 minutes before you can connect to the sys tenant. The same applies to the business tenants.

      3. If OBServer is not restored 15 minutes after the restart or force restart, go to Step 3.

    • If the leader is available, go to Step 4.

  4. Collect the information of the OBServer logs and contact OCP technical support for help.

Contact Us