ob_tenant_task_timeout

2023-08-22 02:46:05  Updated

Description

Some operations in OceanBase Database, such as minor compactions and major compactions, are performed as tasks. This alert is triggered if the task execution duration exceeds the threshold, which is 3 hours by default.

Principle

The following table describes the key parameters that are involved in the monitoring and alerting logic.

Parameter Value
Metric ob_sys_task_max_duration_seconds
Source SELECT /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK)*/ __all_tenant.tenant_id,__all_tenant.tenant_name, sys_task.task_id, sys_task.task_type, sys_task.max_sys_task_duration_seconds FROM (SELECT tenant_id, task_type, task_id, max(TIMESTAMPDIFF(SECOND, sys_task_status.start_time, CURRENT_TIMESTAMP ) ) max_sys_task_duration_seconds FROM __all_virtual_sys_task_status sys_task_status WHERE svr_ip = ? AND svr_port = ? GROUP BY tenant_id,task_type,task_id ) sys_task LEFT JOIN__all_tenant ON __all_tenant.tenant_id = sys_task.tenant_id
Collected metric ob_sys_task_max_duration_seconds
Metric expression max(ob_sys_task_max_duration_seconds{@LABELS}) by (@GBLABELS)
Collection cycle 60 seconds

Alert rule

Metric Default threshold (unit: s) Source Detection cycle Time before clearance
ob_sys_task_max_duration_seconds 10800 SQL collection 60 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Critical Tenant

Alert templates

  • Alert overview

    • Template: ${alarm_target} ${alarm_name}

    • Example: ob_cluster=obcluster:1001,tenant_name=tenant1,svr_ip=xxx.xxx.xxx.xxx,task_type=sstable A system task has timed out in the OceanBase Database tenant.

  • Alert details

    • Template: Cluster: ${ob_cluster_name}; Tenant: ${tenant_name}; Host: ${svr_ip} (zone: ${obzone}); Alert: A system task has timed out in the OceanBase Database tenant. Task Type: ${task_type}; Execution Time: ${value_shown}, which exceeds the threshold of ${alarm_threshold} seconds.

    • Example: Cluster: obcluster; Tenant: tenant1; Host: xxx.xxx.xxx.xxx (zone: zone1); Alert: A system task has timed out in the OceanBase Database tenant. Task Type: sstable major merge; Execution Time: 3 hours, which exceeds the threshold of 1800 seconds.

Impact on the system

System tasks that have been running for a long time and remain in the unfinished state fall short of business expectations and must be diagnosed. For example, a minor compaction or major compaction task that remain unfinished for a long time prevents the memory from being released, and the occupied memory space continuously increases. This drains the memory and affects the system stability.

Possible causes

Check for the following possible causes:

  1. If a minor compaction task remains unfinished for a long time, check whether the write speed is higher than the compaction speed.

  2. High system loads affect task scheduling.

Solutions

  1. Check whether the task type and execution duration are as expected.

    For example, a major compaction task may last for several hours, which exceeds the alert threshold and triggers an alert. In this case, you can adjust the threshold. If a major compaction task remains unfinished for a long time, identify the cause and perform troubleshooting accordingly.

  2. Identify the cause and adopt a solution based on the task type.

    You can view the system load, SQL diagnostics, and transaction diagnostics to determine the cause.

Contact Us