This topic describes the data verification feature of OceanBase Migration Service (OMS). OMS identifies and processes inconsistent data by means of data sharding, caching, joining, and verification.
Terms
| Term | Description | Remarks | |
|---|---|---|---|
| Sharding | A database optimization strategy that splits a large table into smaller parts to improve the efficiency of concurrent queries. For example, a table contains 300 records and has a primary key id whose value ranges from 1 to 300. Assume that the table is split into three shards:Shard 1 with the id value less than 100Shard 2 with the id value ranging from 101 to 200Shard 3 with the id value greater than 200Each shard is a subset of the table. Sharding is suitable for large-scale database systems where a full table query takes a long time. It splits large tables to improve database performance and scalability. |
The maximum amount of data contained in a shard is specified by the limitator.select.batch.max parameter. |
|
| Data read | An operation to query data from the source and target databases based on specific conditions by using the JDBC driver. | ||
| Cache | An operation to join data read from the source and target databases based on primary keys or unique keys. | ||
| Verification | A step where OMS fetches the data joined by primary keys or unique keys from the cache, and compares full fields of data. | ||
| Reverification | An operation to query records for inconsistent data found in verification from the source and target databases and compare the records. | The maximum number of reverification rounds is specified by the limitator.reviewer.rounds.max parameter. |
|
| Correction | An operation to generate an INSERT, UPDATE, or DELETE statement based on inconsistent data found in verification as well as the primary key or unique key of the record, and execute the statement to the target database. This ensures consistency between the record corresponding to the primary key or unique key and the data in the source database. |
The correction type is specified by the filter.verify.rectify.type parameter. |
|
| Sharding thread | A thread for splitting tables. You can use multiple sharding threads to concurrently split multiple tables. | The number of sharding threads is specified by the limitator.platform.split.threads.number parameter. |
|
| Data source read thread | A thread for reading data from the data source. You can use a source read thread and a target read thread to concurrently read data from the source and target databases. | The number of data source read threads is specified by the limitator.platform.threads.number parameter. |
|
| Verification thread | A thread for verifying data. You can use multiple verification threads to concurrently verify data. | The number of verification threads is specified by the limitator.verify.thread.size parameter. |
|
| Reverification thread | A thread for reverifying data. You can use multiple reverification threads to concurrently reverify data. | The number of reverification threads is specified by the limitator.review.thread.size parameter. |
|
| Correction thread | A thread for correcting data to the target database. You can use multiple correction threads to concurrently correct data. | The number of correction threads is specified by the limitator.rectify.thread.size parameter. |
Data verification procedure
OMS performs full data verification in the following procedure:
The sharding thread splits the source table and places all shards into queue 1.
The data source read threads in the source and target databases fetch shards from queue 1 and concurrently read data from the source and target databases based on the shards.
The system delivers the data read from the source and target databases for the same shard to the same cache. In the cache, the system joins data based on their primary keys or unique keys. This ensures that source data and target data corresponding to the same shard are stored to the same cache and each shard corresponds to one cache. If the join operation is successful, source data and target data corresponding to the specified primary key or unique key are ready. Then, the system places joined data into queue 2.
The verification thread fetches joined data from queue 2 for verification. If the data corresponding to a primary key or unique key is inconsistent, the system sends the primary key or unique key of the record to the reverification scheduler.
The reverification scheduler uses a specific algorithm to determine the primary keys or unique keys of inconsistent records that meet the reverification condition, and adds them to queue 3.
The reverification thread fetches the primary keys or unique keys of inconsistent records from queue 3 and reverifies the records.
Technical mechanism of data verification
OMS supports regular verification and IN mode verification. The two verification modes differ in the method for reading data from the target table.
Read data from the source table
OMS splits the source table into multiple shards and reads data by shard. For example, a table contains 300 records and has a primary key id whose value ranges from 1 to 300. Assume that the table is split into three shards:
Shard 1 with the
idvalue less than 100Shard 2 with the
idvalue ranging from 101 to 200Shard 3 with the
idvalue greater than 200
OMS reads data from the source table by shard regardless of the verification mode.
// Read data of shard 1 from the source table.
SELECT * FROM t WHERE id <= 100
// Read data of shard 2 from the source table.
SELECT * FROM t WHERE id > 101 AND id <= 200
Read data from the target table
OMS reads data from the target table in different ways based on the verification mode. In regular verification, OMS reads data from the target table by shard the same way it reads data from the source table.
// The table name suffixed with `_map` indicates the target table corresponding to the source table.
// Read data of shard 1 from the target table.
SELECT * FROM t_map WHERE id <= 100
// Read data of shard 2 from the target table.
SELECT * FROM t_map WHERE id > 101 AND id <= 200
In IN mode verification, OMS does not read data from the target table by shard. Instead, it calculates primary key values based on the data read from a shard of the source table, and then reads data from the target table based on the primary key values.
// Read data of shard 1 from the source table.
SELECT * FROM t WHERE id <= 100
// Assume that three data records corresponding to the primary key values 7, 23, and 37 are read.
// Read data from the target table based on the primary key values.
SELECT * FROM t_map WHERE id = 7 OR id = 23 OR id = 37;
// Read data of shard 2 from the source table.
SELECT * FROM t WHERE id > 101 AND id <= 200
// Assume that five data records corresponding to the primary key values 111, 112, 113, 116, and 117 are read.
// Read data from the target table based on the primary key values.
SELECT * FROM t_map WHERE id = 111 OR id = 112 OR id = 113 OR id = 116 OR id = 117;
Scenarios of IN mode verification
Manual configuration: You can use the
filter.verify.inmod.tablesparameter to specify the tables to which you want to apply IN mode verification. If you set this parameter to.*;.*;.*, all tables will be verified in IN mode.Notice
When you specify this parameter, make sure that the setting is appropriate.
You can apply IN mode verification to all tables in the following scenarios:
Data is synchronized from multiple source tables to one target table. For example, data is migrated from tables A_1, A_2, ..., and A_N tables in the source database to table A in the target database. In this case, you must apply IN mode verification to all tables.
The target table contains a large amount of data. When you create a task to migrate data to the target table, if you select Ignore for Processing Strategy When Target Table Has Records, the system delivers the parameter that specifies to apply IN mode verification to all tables.
Partitioned-table verification. When you migrate data between HASH- or KEY-partitioned tables that use different partitioning algorithms, the same data may be distributed to different partitions in the source and target tables. In this case, excessive data inconsistency errors may be falsely returned if you use the regular verification mode.
Inconsistent partition definitions. When the source and target tables have different partition definitions, we recommend that you use IN mode verification.
Multi-table aggregation. When data is synchronized from multiple source tables to one target table, IN mode verification can avoid excessive false inconsistency errors about data that exists only in the target table.
Notice
IN mode verification has the following limitations:
- Performance: IN mode verification has lower efficiency than regular verification if only one table is verified.
- Feature: IN mode verification cannot verify the scenario where the target table contains more data than the source table.