The high available (HA) module of OceanBase Migration Service (OMS) ensures the stability and continuity of data transmission. When a server on which OMS is deployed or a task process fails, the HA module can promptly detect the failure and then take steps to restore the service.
Disaster recovery levels supported by the HA module
| Disaster recovery level | Supported? | Deployment requirement | Description |
|---|---|---|---|
| City-level disaster recovery | No | - | This level is not supported. |
| IDC-level disaster recovery | Yes | Multiple nodes in a single region Multiple nodes in multiple regions |
OMS must be deployed on multiple servers across multiple IDCs in the same region. When an IDC fails, the capacity of other IDCs must be sufficient to undertake all tasks of the failed IDC. |
| Server-level disaster recovery | Yes | Multiple nodes in a single region Multiple nodes in multiple regions |
OMS must be deployed on multiple nodes. If you deploy OMS in multiple regions, you must configure at least two servers in each region. When a server fails, the capacity of other servers in the same region must be sufficient to undertake all tasks of the failed server. |
| Component-level disaster recovery | Yes | A single node in a single region Multiple nodes in a single region Multiple nodes in multiple regions |
No requirements for OMS deployment. An abnormal component is restarted or recreated. |
Notice
Only resources in the same region can be scheduled for disaster recovery. In other words, cross-IDC resource scheduling is allowed, but cross-region resource scheduling is not allowed.
HA capabilities of components
The following table describes the HA capabilities of four OMS components.
| Component Name | Purpose | Providing HA capabilities? |
|---|---|---|
| Store | Collects incremental logs of a database. | Yes |
| Incr-Sync | Synchronizes the incremental data of a database. | Yes |
| Full-Import | Migrates all data of a database. | No |
| Full-Verification | Verifies all migrated data of a database. | No |
HA triggering scenarios
From the perspective of HA, IDC-level disaster recovery is the same as server-level disaster recovery. An IDC-level disaster recovery involves the disaster recovery of multiple servers at the same time. The following table describes the two scenarios in which HA is triggered:
| Store | Incr-Sync | Full-Import | Full-Verification | |
|---|---|---|---|---|
| Server failure | ✅ | ✅ | ❌ | ❌ |
| Component exception | ✅ | ✅ | ❌ | ❌ |
Server failure
Each OMS server runs an oms_drc_supervisor component, which works like an agent and periodically reports heartbeats to the OMS MetaDB. If a server fails to report a heartbeat in a period longer than the interval specified by the checkHostDownIntervalSec parameter during a background OMS inspection, OMS marks the server as failed.
HA behaviors after a server failure
- OMS queries Incr-Sync tasks on the failed server and classifies the Incr-Sync threads into two categories, Running and Other, based on their status.
- OMS attempts to terminate Incr-Sync tasks in the Running state. Usually, this operation cannot succeed. This is because a failed server cannot receive commands.
- To ensure the global uniqueness of the Incr-Sync registration information in the MetaDB, regardless of whether the Incr-Sync tasks on the failed server were running, OMS recreates all these tasks on a healthy server where OMS is deployed in the same region and deletes the registration information about these tasks on the failed server.
- For Incr-Sync tasks that were running on the failed server, OMS starts them on the new server. The status of Incr-Sync tasks that were not running remains unchanged after they are migrated to the new server.
Considerations
- When the HA module migrates an Incr-Sync task, all configurations of the task are copied.
- If the failed server is restored, you will have two Incr-Sync tasks performing incremental synchronization. When you query Incr-Sync processes on the restored server in the OMS console, you cannot get any result because the Incr-Sync registration information was deleted when HA recreated the Incr-Sync tasks on a new server.
Component exceptions
| Component | HA supported? | Exception criteria | HA behavioral logic |
|---|---|---|---|
| Store | ✅ | See HA logic of the Store component. | See HA logic of the Store component. |
| Incr-Sync | ✅ | The component status is abnormal. The component is normal if it is in the Starting, Running, Stopped, or Migrating state. |
OMS attempts to restart the Incr-Sync process. The HA module does not operate on the component within 10 minutes after the restart. |
| Full-Import | ❌ | - | - |
| Full-Verification | ❌ | - | - |
HA logic of the Store component
Multiple Store tasks may exist at the same data source to support downstream consumption. This section describes the complex HA logic, based on which the Store component runs.
The HA module checks the number and status of Store tasks at each data source and performs the following operations:
- If no Store tasks exist, the HA module does not create one.
- If at least one Store task exists and you manually stopped all Store tasks, the HA module does not create one. Scenarios 1 and 2 are usually the results of manual operations by an O&M engineer. To ensure the continuity of Store operations, the HA module does not create a Store task in those scenarios. To perform a large-scale O&M operation, we recommend that you disable the HA module of OMS in advance and enable it again after the O&M operation is completed.
- If the number of Store tasks reaches the maximum value that is specified by the subtopicStoreNumberThreshold parameter, the HA module does not create more Store tasks. An excessive number of Store tasks is one of the most common causes of HA failure. This is to prevent the HA module from continuously creating Store tasks in some extreme cases, which depletes the server resources.
- If no Store task is in the Running state (works normally and reports heartbeats at the specified interval), the HA module will attempt to create a Store task.
- If the data held by running Store tasks cannot meet downstream consumption requirements and you have set the perceiveStoreClientCheckpoint parameter to
true, the HA module will attempt to create a new Store task. For example, when a Store task holds only data of the past 8 hours, but the downstream consumers need data of the past 10 hours, the HA module attempts to create a new Store task.
When the HA module creates a Store task, it must determine the start timestamp, which is a unix timestamp, of the Store task. The start timestamp is the point in time from which the new Store task starts to pull the database logs, The start timestamp is determined based on the following rules:
- If you set the
perceiveStoreClientCheckpointparameter tofalse, the HA module does not detect the downstream consumption timestamp. In this case, the start timestamp =${Current time} - refetchStoreIntervalMin. - If you set the
perceiveStoreClientCheckpointparameter totrue, the HA module checks the earliest consumption timestamp of all downstream consumers. In this case, the start timestamp =${The earliest downstream consumption timestamp} - refetchStoreIntervalMin.
Considerations
- OMS does not support Store sharing between migration projects. If you have created multiple migration projects based on the same source database ID, the HA module creates, manages, and maintains a Store task for each project.
- OMS supports Store sharing between synchronization projects. If you have created multiple synchronization projects based on the same source database ID and enabled Store sharing, the HA module creates and maintains the shared Store task for all those synchronization projects.
- If you have set the
perceiveStoreClientCheckpointparameter totrue, but the HA module does not detect a downstream consumer, the HA module cannot obtain a downstream consumption timestamp. In this case, the HA module does nothing.
HA parameters
The following table describes the system parameters that control the behavior of the HA module. To view and edit the parameters, log on to the OMS console, choose System Management > System Parameters, and then search for ha.config on the page that appears.
| Scope | Parameter | Type | Default value | Description |
|---|---|---|---|---|
| Global control | enable | Boolean | false | Specifies whether to enable the HA module. |
| Global control | checkRequestIntervalSec | Integer | 600 | The minimum time interval between HA operations on the same object. This parameter prevents the HA module from performing a large number of repeated disaster recovery operations on the same object in a short period. |
| Server failure | enableHost | Boolean | false | Specifies whether to enable the HA module on a failed server. |
| Server failure | checkHostDownIntervalSec | Integer | 540 | The maximum time interval, in seconds, based on which a server failure is determined. If a server does not report a heartbeat in a period longer than the specified value, the HA module determines that the server fails. |
| Component exception (Store) | enableStore | Boolean | true | Specifies whether to enable the HA module for the Store component. |
| Component exception (Store) | checkModuleExceptionIntervalSec | Integer | 240 | The maximum heartbeat reporting interval for determining whether a Store task is abnormal, in seconds. If a Store task does not report a heartbeat in a period longer than the specified value, the HA module considers it abnormal. |
| Component exception (Store) | subtopicStoreNumberThreshold | Integer | 5 | The maximum number of Store tasks that can be created for a project-data source pair. When the number of Store tasks reaches the specified value, the HA module does not create a new one. |
| Component exception (Store) | refetchStoreIntervalMin | Integer | 30 | The number of minutes between the start timestamp of the Store task created by the HA module and the current time. Notice: If the latency of downstream consumers is large, the new Store task may not meet the requirements of downstream consumption. For example, when the HA module detects a Store exception at 10:00, it creates a Store task to pull the incremental logs of the database from 9:30. In other words, the start timestamp is 30 minutes before the current time, which is 10:00. |
| Component exception (Store) | perceiveStoreClientCheckpoint | Boolean | false | If this parameter is set to true, the HA module creates a Store task to pull the incremental logs from the point in time that is a number of minutes before the earliest consumption timestamp of all downstream consumers. This number of minutes is specified by the refetchStoreIntervalMin parameter. This ensures that the new Store task meets the requirements of all downstream consumers. For example, the when HA module detects a Store exception at 10:00, it creates a Store task. At this moment, the earliest incremental logs consumed by downstream consumers were generated at 9:40. Therefore, the new Store task pulls the incremental logs from 9:10. In other words, the start timestamp is 30 minutes before 9:40, the earliest consumption timestamp of all downstream consumers. |
| Component exception (Store) | clearAbnormalResourceHours | Integer | 72 | The number of hours after which an abnormal Store task has not reported a heartbeat and thus can be cleared by the HA module to release the disk space and reduce the HA restrictions caused by the excessive number of Store tasks. |
| Component exception (Incr-Sync) | enableConnector | Boolean | true | Specifies whether to enable the HA module for the Incr-Sync component. |