Alert description
This alert monitors whether the major compaction of each tenant in the OceanBase cluster is suspended. When OCP detects that the major compaction of a tenant is suspended, it immediately triggers this alert.
Major compaction is a core background operation in OceanBase Database. It merges incremental data with baseline data to free up storage space and optimize query performance. This feature supports tenant-level major compaction management starting from OceanBase Database V4.0.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | tenant_compaction_status |
| Monitoring Expression | max(tenant_compaction_status{@LABELS}) by (@GBLABELS) |
| Metric Collection | tenant_compaction_status |
| Metric Source | OCP Scheduled Task CheckSuspendTenantCompactionTask Queries OceanBase Cluster System Views in Real TimeCDB_OB_MAJOR_COMPACTION, obtainis_suspendedDetermines whether tenant major compactions are suspended. |
| Collection Cycle | 60 Seconds |
OCP runs the scheduled task TenantSchedules.checkSuspendTenantCompactionStatus() every 60 seconds to collect data. The specific process is as follows:
Filter out clusters that are OceanBase Database V4.0 or later, running, and not registered in black screen mode.
For each cluster, execute the following SQL statement in real time to query the OceanBase cluster by using the
CompactionOperator.listObTenantCompaction()method:SELECT tenant_id, global_broadcast_scn AS broadcast_scn, is_error AS error, status, frozen_scn, last_scn, is_suspended AS suspend, info, start_time, last_finish_time FROM CDB_OB_MAJOR_COMPACTIONFor each tenant, determine the major compaction status based on the
is_suspendedfield in the query results:is_suspendedisYES(suspend = true), the metric value for release is 1;is_suspendedisNO(suspend = false), and the release metric value is 0.
When the value of the monitoring metric tenant_compaction_status is 1, it indicates that the major compaction for the tenant is paused, triggering an alert.
Rule information
Monitoring Metrics |
Default Threshold |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| tenant_compaction_status | 1 | N/A | 60 Seconds | 5 Minutes |
A duration of N/A indicates that the alert is triggered immediately once the metric value equals 1, without requiring continuous monitoring.
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | Critical | Tenant |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1:tenant_name=mytenant Tenant Major Compaction Suspended
Alert Details
- Template: The major compaction status of tenant ${tenant_name} in OceanBase cluster ${ob_cluster_name} is paused. It is recommended to continue the major compaction.
- Example: The major compaction status of tenant mytenant in OceanBase cluster obcluster-1 is paused. It is recommended to continue the major compaction.
Alert recovery
- Template: Alert: ${alarm_name}, Is tenant major compaction status paused: ${value_shown}
- Example: Alert: Tenant major compaction paused. Is the tenant major compaction status paused? 0
Here, ${alarm_target} indicates the object that triggered the alert, in the format ob_cluster=xxxxxxx:tenant_name=xxxxxxx.
Impact on the system
A long-term suspension of tenant major compaction may cause the following issues:
Continuous storage space consumption: Incremental data cannot be merged with baseline data, causing redundant data in the MemTable and SSTable to accumulate. This leads to a continuous increase in disk usage, which may trigger a low disk space alert in severe cases.
Query performance degradation: An increasing number of unmerged incremental data layers require the read operation to scan more data levels, leading to increased query latency.
Increased minor compaction pressure: After major compaction is suspended, the number of frozen MemTables continues to increase, which may trigger the tenant_active_memstore_percent_over_threshold alert (excessive MemTable memory usage). In severe cases, this can lead to write throttling.
Affects subsequent major compactions: Not performing a major compaction for an extended period may lead to increased time consumption for subsequent major compactions and a greater impact on system performance during those compactions.
Possible causes
This is commonly seen in the following scenarios:
Manually initiated suspension: The administrator manually suspends the tenant's major compaction operation through the OCP interface or an SQL command (
ALTER SYSTEM SUSPEND MERGE). This is typically done to avoid the impact of major compaction on business performance during specific periods, such as promotional campaigns or data migrations.Suspension after a major compaction error: An error occurs during the tenant's major compaction, and the system or administrator suspends the compaction to prevent the error from escalating. In this scenario, the ob_tenant_compaction_error (Tenant Major Compaction Error) alert is typically triggered as well.
Major compaction not resumed after O&M operations: Major compaction was suspended during O&M operations such as cluster upgrade, node replacement, or data migration, and the operator forgot to resume it after the operations were completed.
Third-party tool or script operation: An automated O&M script or a third-party management tool paused a major compaction during the execution of a specific operation but did not automatically resume it after the operation completed.
Solution
Confirm the tenant information involved in the alert. Obtain the cluster name (ob_cluster_name) and tenant name (tenant_name) from the OCP alert details to locate the specific tenant.
Confirm whether the major compaction pause is the expected operation. Contact the cluster administrator to verify if the major compaction was manually paused. If it is a planned operation (such as being paused intentionally during a promotion), you can temporarily ignore the alert, but ensure that the major compaction is resumed promptly after the operation ends.
Check for any major compaction errors. Verify if the
ob_tenant_compaction_erroralert is also present. If the suspension is caused by a major compaction error, you must first troubleshoot and resolve the error. You can execute the following SQL statement under the sys tenant of the OceanBase cluster to view the major compaction details:SELECT tenant_id, status, is_error, is_suspended, info FROM CDB_OB_MAJOR_COMPACTION;If
is_erroris YES, the info field contains specific error information. You can execute the following SQL statement to clear the error status:ALTER SYSTEM CLEAR MERGE ERROR TENANT = '<tenant_name>';Resume tenant major compaction. After confirming it is safe to resume, perform the following operations via the OCP interface or execute the following SQL statement to resume the major compaction:
ALTER SYSTEM RESUME MERGE TENANT = '<tenant_name>';You can also click the Resume Major Compaction button on the tenant's major compaction management page in OCP.
Verify whether the major compaction has resumed. After it resumes, query again to confirm the major compaction status:
SELECT tenant_id, status, is_suspended FROM CDB_OB_MAJOR_COMPACTION WHERE tenant_id = <tenant_id>;After confirming that
is_suspendedis NO, OCP will update the metric value to 0 in the next collection cycle (60 seconds), and the alert will automatically disappear within a maximum of 5 minutes (resolve_timeout_seconds: 300).If none of the above methods resolves the issue, contact OCP technical support for troubleshooting.
