Overview
This topic provides best practices for end-to-end data migration from mainstream analytical processing (AP) databases, including Huawei Cloud DWS, StarRocks, and Alibaba Cloud AnalyticDB for MySQL (ADB), to OceanBase Database.
OceanBase Database supports efficient and scalable direct load through external tables, making it suitable for large-scale data migration scenarios.
Overall migration solution
We recommend the "export-transit-import" three-stage architecture:
- Export phase: In the source AP database, use a writable external table to export data to a structured file (such as Parquet or ORC) and store it in an object storage service (such as S3 or OSS) or HDFS.
- Transit phase: The data is temporarily stored in the object storage service or distributed file system, serving as an intermediate medium.
- Import phase: OceanBase Database reads the files directly using a read-only external table and leverages direct load capabilities to efficiently write the data to the target table.
This solution offers the following advantages:
- Decouples the source and target systems, minimizing the impact on the production environment;
- Utilizes columnar storage formats to enhance I/O and compression efficiency;
- Supports parallel direct load, significantly boosting throughput.
Notice
Currently, OceanBase Database 4.4.x primarily supports full direct load. For incremental data, we recommend using OceanBase Migration Service (OMS).
Export data from the source database
Supported databases
The following AP databases natively support data export through external tables:
- Huawei Cloud GaussDB(DWS)
- StarRocks
- Alibaba Cloud AnalyticDB for MySQL (ADB)
Recommended export format
Prioritize columnar storage formats: Parquet or ORC.
Advantages of the format
- High compression rate: Significantly reduces the volume of intermediate data, saving storage and network bandwidth;
- Strong type and nested structure support: Avoids issues with delimiters and escaping in text formats like CSV;
- High-performance read/write: Compared to CSV, import/export performance improves by approximately 20%–30%, making it ideal for wide tables and complex data types.
Official documentation reference
Database |
Documentation Link |
|---|---|
| DWS | DWS Data Export Guide |
| StarRocks | Use INSERT INTO FILES to export data |
| ADB | Create an external table Export data to OSS |
Export example
StarRocks export example
INSERT INTO FILES(
"path" = "s3://mybucket/unload/data1",
"format" = "parquet",
"compression" = "uncompressed",
"target_max_file_size" = "1024", -- Unit: bytes (example: 1KB)
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy",
"aws.s3.region" = "us-west-2"
)
SELECT * FROM sales_records;
ADB export example
CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_demo.osstest3 (
A STRUCT<var1:STRING, var2:INT>
)
STORED AS PARQUET
LOCATION 'oss://testBucketName/osstest/Parquet';
INSERT INTO adb_external_demo.osstest3 SELECT * FROM t1;
Note
Make sure that the target bucket has the correct access permissions and verify network connectivity.
Import full data into OceanBase Database
OceanBase Database provides two main methods for importing data from object storage:
Use the INSERT INTO URL external table method
This method is suitable for scenarios where you directly specify the file path. It offers flexible syntax and supports regular expression matching for filenames.
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1
SELECT * FROM FILES(
location = 's3://data/?host=xxx&access_id=xxx&access_key=xxx',
format = (TYPE = 'PARQUET'),
pattern = 'datafiles$'
);
Use the LOAD DATA statement
This method is ideal for standardized batch import tasks.
LOAD DATA /*+ direct(true,0) parallel(2) */
FROM 's3://data/?host=xxx&access_id=xxx&access_key=xxx'
INTO TABLE t1
FORMAT (type = 'PARQUET');
Note
Both methods support direct(true) write mode and parallelism control, significantly enhancing import performance.
For detailed syntax and parameter descriptions, see: Overview of full direct load
Directly read external data sources (advanced scenarios)
In addition to object storage, OceanBase Database also supports direct access to heterogeneous data sources such as HDFS and ODPS (MaxCompute) through external tables, making it suitable for users with existing data lake architectures.
HDFS data source
Here is an example of the syntax:
INSERT INTO t1
SELECT * FROM files(
LOCATION = 'hdfs://${namenode_host}:${namenode_port}/user?principal=hdfs/hadoop@EXAMPLE.COM&keytab=/path/to/hdfs.keytab&krb5conf=/path/to/krb5.conf&configs=dfs.data.transfer.protection=authentication,privacy',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '\''
),
PATTERN = 'test_tbl1.csv'
);
ODPS (MaxCompute) data source
Here is an example of the syntax:
INSERT INTO t1
SELECT * FROM SOURCE (
TYPE = 'ODPS',
ACCESSID = 'your_id',
ACCESSKEY = 'your_key',
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
PROJECT_NAME = 'my_project',
TABLE_NAME = 'my_table'
);
Reference documentation: CREATE EXTERNAL TABLE (MySQL mode)
Incremental data synchronization strategy
Currently, the direct load feature of OceanBase Database 4.4.x only supports full load. For continuous write scenarios, we recommend using the "full load + incremental load" hybrid approach.
Full load by partition (limited support for incremental load)
If the source data is partitioned by time (e.g., monthly), you can perform a full direct load on new partitions:
-- Append data to specified partitions
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1 PARTITION(p0, p1)
SELECT * FROM external_table;
-- Overwrite data in specified partitions
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ OVERWRITE t1 PARTITION(p0, p1)
SELECT * FROM external_table;
Limitations
- Table-level locking: Even if you specify partitions, the import process will lock the entire table, preventing parallel import tasks for multiple partitions;
- Hidden table overhead: A temporary hidden table must be created for the entire table during each import, significantly increasing metadata operation time when there are numerous partitions.
Recommendation: For highly partitioned tables, consider using Partition Exchange to avoid these issues. For more information, see Partition Exchange.
Use OMS for incremental synchronization
For real-time synchronization of incremental changes, we recommend using OceanBase Migration Service (OMS):
Prerequisites: The source system must support CDC (Change Data Capture);
AP Database Limitations: DWS, StarRocks, and ADB typically do not provide native CDC outputs;
Best Practice: Capture incremental logs from the upstream business systems (e.g., MySQL, Oracle) of these AP databases and synchronize them to OceanBase via OMS.
Supported Databases: OMS supports incremental synchronization from Oracle (including OGG) and MySQL, among other popular OLTP databases.
For detailed instructions, see: Use OMS for data migration
