Alert description
Note
This alert is only applicable to clusters of OceanBase Database versions earlier than 4.0.0.0.
Redo logs are synchronized between the primary and standby databases to ensure data consistency between the primary and standby clusters. In asynchronous log transmission mode, log transmission between the primary and standby clusters is not real-time, and there may be a certain delay in log synchronization to the standby cluster. Normally, this delay does not exceed 10 minutes. If the delay exceeds 10 minutes, the alert is triggered.
Alert principle
Note
This alert is only applicable to clusters of OceanBase Database versions earlier than 4.0.0.0.
The following table describes the key parameters involved in the monitoring logic of this alert.
| Parameter | Value |
|---|---|
| Monitoring metric | sync_delay_time |
| Data source | The value of the CURRENT_SCN field in the internal table v$OB_CLUSTER is collected by OCP-Server. After an OceanBase cluster is created for 1 hour and a user tenant exists in the primary database, but a system tenant does not, the value is collected. CURRENT_SCN indicates the time point when the primary and standby databases are synchronized. The synchronization delay time is calculated by subtracting the value of this field from the current time. |
| Metric collection unit (unit: seconds) | sync_delay_time |
| Monitoring expression | max(sync_delay_time{@LABELS}) by (@GBLABELS) |
| Metric collection cycle | 60 seconds |
The value of the monitoring metric sync_delay_time indicates the delay time between the primary and standby clusters of OceanBase Database. If the delay time exceeds the threshold (600 seconds by default), the alert is triggered.
Alarm rules
| Monitoring metric | Default threshold (unit: seconds) | Duration | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| sync_delay_time | 600 seconds | 0 seconds | 60 seconds | 5 minutes |
Alarm information
| Alarm trigger method | Alarm level | Scope |
|---|---|---|
| Based on the expression of the monitoring metric | Warning | OB cluster |
Alarm template
Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1 OceanBase Cluster Synchronization Delay Exceeds
Details
- Template: Cluster: ${ob_cluster_name}, Alarm: ${alarm_name} Log transmission delay ${value_shown} exceeds ${alarm_threshold} seconds.
- Example: Cluster: obcluster-1, Alarm: OceanBase Cluster Synchronization Delay Exceeds, Log transmission delay 3994.293 seconds exceeds 600.0 seconds.
Restoration
- Template: Alarm: ${alarm_name}, OceanBase Primary and Standby Cluster Transmission Delay: ${value_shown}
- Example: Alarm: OceanBase Cluster Synchronization Delay Exceeds, OceanBase Primary and Standby Cluster Transmission Delay: 566.0 seconds
where ${alarm_target} is in the format of ob_cluster=xxxxxxx, and ob_cluster is the name of the cluster that generated the alarm.
Impact on the system
A large amount of delay in the standby cluster may cause data inconsistency, which in turn may prevent a smooth switchover.
Possible causes
Network failure in the primary and standby clusters.
High load or insufficient resources in the standby cluster.
Unusual servers in the primary and standby clusters.
Unavailability of the primary cluster.
Abnormal suspended synchronization tenant tasks in the standby cluster.
Procedure
Follow the steps below to determine the cause of the alert and resolve the issue.
Run the following command to check whether the synchronization latency between the primary and standby clusters remains high.
use oceanbase; select (time_to_usec(now(6)) - current_scn)/1000000 from v$ob_cluster;If the query result is less than 600, the latency is normal, and you can manually clear the alert.
If the query result is greater than 600, the latency remains high. Proceed to the next step.
Identify the cause of the alert and clear the fault.
Run the
pingcommand to check whether the network between the primary and standby clusters is faulty.Check whether the standby cluster is overloaded or resource-deficient.
If the cluster is overloaded or resource-deficient, other alerts will be triggered. You can refer to the corresponding alert to troubleshoot the issue.
Check whether the servers in the primary and standby clusters are running normally.
Log in to the primary and standby clusters, and run the following command to check whether any servers are not started.
SELECT * FROM __all_server WHERE start_service_time <=0;If any servers are not started, restart them.
Check whether the primary cluster can be connected to and whether DDL and DML statements can be executed normally.
If not, contact technical support to locate the cause of the issue and restore the cluster.
Check whether the standby cluster has synchronization errors for a normal tenant.
Log in to the standby cluster, and run the following command to check whether any tenants have abnormal status.
SELECT TENANT_ID,USEC_TO_TIME(REFRESHED_SCHEMA_VERSION),DDL_LAG,USEC_TO_TIME(MIN_SYS_TABLE_SCN),USEC_TO_TIME(MIN_USER_TABLE_SCN) FROM V$OB_CLUSTER_STATS;If any tenants have abnormal status, locate the cause of the task suspension for these tenants.
Check whether the standby cluster has synchronization errors for the sys tenant.
Log in to the primary and standby clusters, and run the following SQL statements to compare the tenants in the primary and standby clusters and record these tenants.
select * from __all_tenant;Log in to the primary cluster, and run the following SQL statement to check the execution time of DDL statements for the tenant and record the time.
select * from __all_ddl_operation where ddl_stmt_str !='';Run the following command to check whether a freeze operation was performed during this period. A freeze operation can cause a cyclic dependency.
select * from __all_core_table where table_name like "%freeze%"If a freeze operation was performed, delete and rebuild the standby cluster.