Description
This alert is triggered when the difference between the major freeze version number and the baseline version number exceeds the threshold.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | ob_cluster_frozen_version_delta |
| Source | SQL: javascript select zone, name, value, time_to_usec(now()) from __all_zone; Note * The value of the time_to_usec(now()) field is used as the value of the current_timestamp metric. * The value of the value field is used as the value of the zone_value metric. |
| Collected metrics | current_timestamp and zone_value |
| Metric expression | max(zone_value{metric_group="all_zone",name="frozen_version",@LABELS}) by (@GBLABELS) - min(zone_value{metric_group="all_zone",name="last_merged_version",@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 second |
The value of the metric ob_cluster_frozen_version_delta indicates the difference between the major freeze version number and the baseline version number of the OceanBase cluster. When the difference exceeds the threshold, this alert is triggered. The default threshold is 1.
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| ob_cluster_frozen_version_delta | 1 | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Critical | Cluster |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The version difference is ${value}, exceeding the threshold of ${alarm_threshold}.
Overview example: ob_cluster=C1-1000. The difference between the major freeze version number and the baseline version number of the OceanBase cluster exceeds the threshold.
Details example: ob_cluster=C1-1000. The difference between the major freeze version number and the baseline version number of the OceanBase cluster is 2.0, exceeding the threshold of 1.0.
Impact on the system
This problem affects the statement response time. In extreme cases, the OBServer memory is overused, the application stops writing, and the clog disk is full.
Possible causes
This problem is commonly found in the following scenarios:
You start a major compaction task when an OceanBase cluster is undergoing an automatic compaction.
You continuously start several major compaction tasks.
You start a new major compaction task before the last major compaction is finished.
Suggested solutions
You can check the following information to determine the causes that triggered the alert.
Choose Compaction Management > Details of Major Freeze in the left-side navigation pane of a cluster and view the information of the latest major compaction.
Choose System Management > Tasks in the left-side navigation pane and view the major compaction tasks.
Perform the following steps based on the causes.
Ignore the alert caused by the manual major compaction task. Wait for the task to complete.
Note: Start a major compaction task only when it is necessary.
A slow or failed major compaction task also triggers other alerts at the same time, such as the ob_cluster_merge_error and the ob_cluster_merge_timeout alerts.
You can solve the problem of those alerts before starting a major compaction task and check whether the alert is triggered again.
For other exceptions, go to the next step.
Try to troubleshoot the problem on your own. For more information, see Exception handling for OceanBase cluster compaction.
If you cannot locate the problem or do not know how to solve it, contact OCP Technical Support for help.