Description
Some operations in OceanBase Database, such as minor compactions and major compactions, are performed as tasks. This alert is triggered if the task execution duration exceeds the threshold, which is 3 hours by default.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_sys_task_max_duration_seconds |
| Source | sql SELECT /*+ MONITOR_AGENT READ_CONSISTENCY(WEAK) */ __all_tenant.tenant_id, __all_tenant.tenant_name, sys_task.task_id, sys_task.task_type, sys_task.max_sys_task_duration_seconds FROM (SELECT tenant_id, task_type, task_id, max(TIMESTAMPDIFF(SECOND, sys_task_status.start_time, CURRENT_TIMESTAMP ) ) max_sys_task_duration_seconds FROM __all_virtual_sys_task_status sys_task_status WHERE svr_ip = ? AND svr_port = ? GROUP BY tenant_id,task_type,task_id ) sys_task LEFT JOIN __all_tenant ON __all_tenant.tenant_id = sys_task.tenant_id |
| Collected metric | ob_sys_task_max_duration_seconds |
| Metric expression | max(ob_sys_task_max_duration_seconds{@LABELS}) by (@GBLABELS) |
| Collection cycle | 60 seconds |
Alert rule
| Metric | Default threshold (unit: s) | Source | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_sys_task_max_duration_seconds | 10800 | SQL collection | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Tenant |
Alert templates
Alert overview
Template: ${alarm_target} ${alarm_name}
Example: ob_cluster=obcluster:1001,tenant_name=tenant1,svr_ip=192.168.1.1,task_type=sstable A system task has timed out in the OceanBase Database tenant.
Alert details
Template: Cluster: ${ob_cluster_name}; Tenant: ${tenant_name}; Host: ${svr_ip} (zone: ${obzone}); Alert: A system task has timed out in the OceanBase Database tenant. Task Type: ${task_type}; Execution Time: ${value_shown}, which exceeds the threshold of ${alarm_threshold} seconds.
Example: Cluster: obcluster; Tenant: tenant1; Host: 192.168.1.1 (zone: zone1); Alert: A system task has timed out in the OceanBase Database tenant. Task Type: sstable major merge; Execution Time: 3 hours, which exceeds the threshold of 1800 seconds.
Impact on the system
System tasks that have been running for a long time and remain in the unfinished state fall short of business expectations and must be diagnosed. For example, a minor compaction or major compaction task that remain unfinished for a long time prevents the memory from being released, and the occupied memory space continuously increases. This drains the memory and affects the system stability.
Possible causes
Check for the following possible causes:
If a minor compaction task remains unfinished for a long time, check whether the write speed is higher than the compaction speed.
High system loads affect task scheduling.
Solutions
Check whether the task type and execution duration are as expected.
For example, a major compaction task may last for several hours, which exceeds the alert threshold and triggers an alert. In this case, you can adjust the threshold. If a major compaction task remains unfinished for a long time, identify the cause and perform troubleshooting accordingly.
Identify the cause and adopt a solution based on the task type.
You can view the system load, SQL diagnostics, and transaction diagnostics to determine the cause.