Description
redo-log logs are synchronized between the primary and standby databases to maintain data consistency. In asynchronous log transmission mode, log transmission between the primary and standby clusters is not in real time. The logs of the standby cluster may have some delay. Under normal circumstances, the delay does not exceed 10 minutes. This alert is triggered when the delay exceeds 10 minutes.
Alert information
| Trigger method | Alert level | Scope |
| Based on the expression of the metric | Warning | OceanBase cluster |
Alert rule
| Metric | Default threshold | Source | Duration | Detection cycle | Elimination cycle |
| sync_delay_time | 600s | The CURRENT_SCN field in the OceanBase internal table v$OB_CLUSTER. | 0s | 60s | 5 min |
Note
The CURRENT_SCN field indicates the time point when logs are synchronous between the primary and secondary databases. The delay is the difference between the CURRENT_SCN value and the current time.
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: OBProxy Cluster: ${obproxy_cluster}, Alert: ${alarm_name}. The log transmission delay is ${value} seconds, exceeding the ${alarm_threshold} seconds threshold.
Overview: ob_cluster=cluster-76. The OceanBase cluster synchronization delay is too long.
Details example: OBProxy Cluster: cluster-76, Alert: The OceanBase cluster synchronization delay is too long. The log transmission delay is 3994.293 seconds, exceeding the threshold of 600.0 seconds.
${alarm_target} follows the ob_cluster=xxxxxxx format, where ob_cluster indicates the object cluster of the alert.
Impact on the system
If the synchronization delay of the standby cluster is too high, data may be inconsistent between the primary cluster and standby cluster, resulting in a switchover failure between the primary and standby clusters.
Possible causes
The network between the primary and standby clusters is faulty.
The load on the standby cluster is too high or the resources on it are insufficient.
The active or standby cluster has an abnormal server.
The primary cluster is unavailable.
The standby cluster has abnormally suspended synchronization tasks.
Solutions
Take the following steps to identify the root cause of the alert. Solve the problem accordingly and clear the alert.
Run the following command to check whether the synchronization delay between the primary and standby clusters is still high:
use oceanbase; select (time_to_usec(now(6)) - current_scn)/1000000 from v$ob_cluster;If the return result is less than 600, the delay is normal. You can clear the alert manually.
If the return result is greater than 600, the delay is too high. Go to the next step.
Identify the cause of the alert and clear the fault.
Run the
pingcommand to check the network between the primary and standby clusters.Check whether the load on the standby cluster is too high and whether the resources on it are sufficient.
If the load is too high or resources are insufficient, other alerts are triggered along with this alert. You can view the corresponding alerts for troubleshooting.
Check whether the servers in the primary and standby clusters are normal.
Log on to the primary cluster and the standby cluster, and run the following command to check whether all servers are started:
SELECT * FROM __all_server WHERE start_service_time <=0;Restart servers that are not started.
Check whether the primary cluster can be connected and whether DDL and DML statements can be properly executed.
If the primary cluster is abnormal, contact OceanBase technical support to troubleshoot and recover the primary cluster.
Check whether the standby cluster has a common tenant with a synchronization exception.
Log on to the standby cluster and run the following command to check whether any tenants are in the abnormal state:
SELECT TENANT_ID,USEC_TO_TIME(REFRESHED_SCHEMA_VERSION),DDL_LAG,USEC_TO_TIME(MIN_SYS_TABLE_SCN),USEC_TO_TIME(MIN_USER_TABLE_SCN) FROM V$OB_CLUSTER_STATS;If an abnormal tenant exists in the standby cluster, identify the cause for the suspension of synchronization tasks.
Check whether the standby cluster has a system tenant with a synchronization exception.
Log on to the primary and standby clusters respectively. Execute the following SQL statements, compare the differences between the primary and standby clusters, and record the information of these tenants.
select * from __all_tenant;Log on to the primary cluster. Execute the following SQL statement to check the DDL execution time related to the tenant and record the time:
select * from __all_ddl_operation where ddl_stmt_str !='';Run the following command to check whether the freeze operation is performed in the specified period. The freeze operation causes the circular dependency problem.
select * from __all_core_table where table_name like "%freeze%"If the freeze operation is performed, delete and recreate the standby cluster.