Description
The primary and standby clusters synchronize the redo logs to keep the data consistent. In asynchronous log transmission mode, the primary and standby clusters transmit the logs with some delay. Typically, the delay does not exceed 10 minutes. Otherwise, the alert is triggered.
Principle
The following table describes the key parameters that are involved in the monitoring and alerting logic.
| Parameter | Value |
|---|---|
| Metric | sync_delay_time |
| Source | OCP-Server uses the value of the CURRENT_SCN field of the internal table v$OB_CLUSTER as the value of the metric. Note OCP-Server starts to collect the data one hour after the OceanBase cluster is created and contains a user tenant other than a system tenant in the primary cluster. The value of the CURRENT_SCN field indicates the time point when the primary and standby clusters reach data consistency. The synchronization delay is the difference between this value and the current time. |
| Collected metric (unit: s) | sync_delay_time |
| Metric expression | max(sync_delay_time{@LABELS}) by (@GBLABELS) |
| Collection cycle | 60 seconds |
The value of the metric sync_delay_time indicates the synchronization delay between the primary and standby OceanBase clusters. When this value is greater than the threshold, this alert is triggered. The default threshold is 600s.
Alert rule
| Metric | Default threshold (unit: s) | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| sync_delay_time | 600 seconds | 0 seconds | 60 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Metric expression | Warning | Cluster |
Alert templates
Overview: ${alarm_target} ${alarm_name}
Details: ${alarm_target} ${alarm_name}. The log transmission latency is ${value}s, exceeding the threshold of ${alarm_threshold}s.
Overview example: ob_cluster=cluster-76. The latency of Oceanbase clusters synchronization is too long.
Details example: ob_cluster=cluster-76. The latency of Oceanbase clusters synchronization is too long. The log transmission latency is 3994.293s, exceeding the threshold of 600.0s.
${alarm_target} follows the ob_cluster=xxxxxxx format. ob_cluster indicates the name of the cluster that generated the alert.
Impact on the system
Extended latency of the standby cluster synchronization may lead to data inconsistency and interrupt the primary/standby switchover.
Possible causes
The network connection between the primary and standby clusters are disconnected.
The standby cluster is overloaded or has insufficient resources.
Abnormal servers exist in the primary or standby cluster.
The primary cluster is unavailable.
The standby cluster has a suspended tenant synchronization task.
Suggested solutions
You can perform the following steps to identify the causes that triggered the alert and solve the problems.
Run the following command to check the current synchronization delay.
use oceanbase; select (time_to_usec(now(6)) - current_scn)/1000000 from v$ob_cluster;If the result is less than 600, you can manually clear the alert because the delay is normal.
Otherwise, go to the next step.
Identify the triggering cause and solve the problem.
Run the
pingcommand to check the network connection between the primary and standby clusters.Check the load and resources of the standby cluster.
Excessively high load or insufficient resources of the standby cluster also trigger other alerts at the same time. You can check topics of these alerts for troubleshooting.
Check the servers in the primary and standby clusters.
Log on to the primary and standby clusters respectively, and run the following command to check for any inactive servers.
SELECT * FROM __all_server WHERE start_service_time <=0;Restart inactive servers, if any.
Check the execution of DDL and DML statements in the primary cluster.
If the execution fails, contact an OCP technical support engineer to locate the problem and recover the primary cluster.
Check the synchronization status of user tenants in the standby cluster.
Log on to the standby cluster and run the following command:
SELECT TENANT_ID,USEC_TO_TIME(REFRESHED_SCHEMA_VERSION),DDL_LAG,USEC_TO_TIME(MIN_SYS_TABLE_SCN),USEC_TO_TIME(MIN_USER_TABLE_SCN) FROM V$OB_CLUSTER_STATS;For user tenants with suspended synchronization tasks, you need to identify the causes.
Check the synchronization with the system tenant.
Log on to the primary and standby clusters respectively, run the following SQL statements, compare the tenants of the primary and standby clusters, and record the tenants with differences.
select * from __all_tenant;Log on to the primary cluster and run the following SQL statement to check and record the time of any DDL operations that are related to the tenant.
select * from __all_ddl_operation where ddl_stmt_str !='';Run the following command to check for freeze operations performed during that period. Freeze operations can cause a circular dependency problem.
select * from __all_core_table where table_name like "%freeze%"If you find a record of a freeze operation, delete the standby cluster and create a new one.