Overview
This topic provides end-to-end best practices for migrating data from mainstream analytical processing (AP) databases, including Huawei Cloud DWS, StarRocks, and Alibaba Cloud ADB (AnalyticDB for MySQL), to OceanBase.
OceanBase supports efficient, scalable direct load through the external table mechanism, making it suitable for large-scale data migration scenarios.
Overall migration solution
We recommend adopting a three-phase architecture of "export - staging - import":
- Export phase: In the source AP database, export data to structured files (such as Parquet or ORC) using writable external tables, and store them in object storage (such as S3 or OSS) or HDFS.
- Staging phase: Temporarily store the data in object storage or a distributed file system as an intermediate medium.
- Import phase: OceanBase directly reads the files mentioned above through read-only external tables and leverages direct load capabilities to efficiently write the data into target tables.
This solution has the following advantages:
- Decouples the source and target systems, reducing impact on the production environment.
- Improves I/O and compression efficiency by using columnar storage formats.
- Supports parallel and direct load, significantly increasing throughput.
Notice
The current OceanBase Database V4.4.x version primarily supports full direct load; for incremental data, it is recommended to use OceanBase Migration Service (OMS).
Export data from the source database
Supported databases
The following AP databases natively support exporting data via external tables:
- Huawei Cloud GaussDB (DWS)
- StarRocks
- Alibaba Cloud AnalyticDB for MySQL (ADB)
Recommended export format
Prioritize columnar storage formats: Parquet or ORC.
Format advantages
- High compression ratio: Significantly reduces intermediate data volume, saving storage and network bandwidth.
- Strong type and nested structure support: Avoids delimiter conflicts and escape issues found in text formats like CSV.
- High-performance read/write: Improves import/export performance by approximately 20%–30% compared to CSV, especially suitable for wide tables and complex data types.
Official documentation reference
database |
Document link |
|---|---|
| DWS | DWS Data Export Guide |
| StarRocks | Use INSERT INTO FILES to export data |
| ADB | Create an external table Export data to OSS |
Export examples
StarRocks export example
INSERT INTO FILES(
"path" = "s3://mybucket/unload/data1",
"format" = "parquet",
"compression" = "uncompressed",
"target_max_file_size" = "1024", -- Unit: bytes (example: 1KB)
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy",
"aws.s3.region" = "us-west-2"
)
SELECT * FROM sales_records;
ADB export example
CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_demo.osstest3 (
A STRUCT<var1:STRING, var2:INT>
)
STORED AS PARQUET
LOCATION 'oss://testBucketName/osstest/Parquet';
INSERT INTO adb_external_demo.osstest3 SELECT * FROM t1;
Note
Ensure the target bucket has correct access permissions configured and verify network connectivity.
Full data import to OceanBase
OceanBase provides two mainstream methods for importing data from object storage:
Import using an INSERT INTO URL external table
Suitable for scenarios where file paths can be directly specified. The syntax is flexible and supports regular expression matching of filenames.
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1
SELECT * FROM FILES(
location = 's3://data/?host=xxx&access_id=xxx&access_key=xxx',
format = (TYPE = 'PARQUET'),
pattern = 'datafiles$'
);
Import using a LOAD DATA statement
Suitable for standardized batch import tasks.
LOAD DATA /*+ direct(true,0) parallel(2) */
FROM 's3://data/?host=xxx&access_id=xxx&access_key=xxx'
INTO TABLE t1
FORMAT (type = 'PARQUET');
Note
Both methods support the direct(true) direct write mode and parallelism control, which can significantly improve import performance.
For detailed syntax and parameter descriptions, see: Overview of full direct load
Directly read external data sources (Advanced scenarios)
Besides object storage, OceanBase also supports directly accessing heterogeneous data sources such as HDFS and ODPS through external tables, which is suitable for users with existing data lake architectures.
HDFS data source
Syntax example:
INSERT INTO t1
SELECT * FROM files(
LOCATION = 'hdfs://${namenode_host}:${namenode_port}/user?principal=hdfs/hadoop@EXAMPLE.COM&keytab=/path/to/hdfs.keytab&krb5conf=/path/to/krb5.conf&configs=dfs.data.transfer.protection=authentication,privacy',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '\''
),
PATTERN = 'test_tbl1.csv'
);
ODPS (MaxCompute) data source
Syntax example:
INSERT INTO t1
SELECT * FROM SOURCE (
TYPE = 'ODPS',
ACCESSID = 'your_id',
ACCESSKEY = 'your_key',
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
PROJECT_NAME = 'my_project',
TABLE_NAME = 'my_table'
);
Reference document: CREATE EXTERNAL TABLE (MySQL-compatible mode)
Incremental data synchronization strategies
The direct load feature in the current OceanBase Database V4.4.x series only supports full import. For continuous write scenarios, it is recommended to adopt a "full + incremental" hybrid approach.
Full import at the partition level (with limited incremental support)
If the source data is partitioned by time (for example, monthly), you can perform a full direct load for new partitions:
-- Import Specified Partition (Append)
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1 PARTITION(p0, p1)
SELECT * FROM external_table;
-- Overwrite Specified Partition
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ OVERWRITE t1 PARTITION(p0, p1)
SELECT * FROM external_table;
Limitations
- Table-level lock limitation: Even when specifying a partition, the entire table is locked during the import process, preventing parallel execution of import tasks for multiple partitions.
- Hidden table overhead: A temporary hidden table must be created for the entire table each time an import occurs. When the number of partitions is large, metadata operations significantly increase in duration.
Optimization suggestion: For tables with a high number of partitions, you can combine the Partition Exchange technique to avoid the above issues.
Use OMS for incremental synchronization
For scenarios requiring real-time synchronization of incremental changes, it is recommended to use OceanBase Migration Service (OMS):
Applicability: The source must have Change Data Capture (CDC) capability.
AP database limitations: DWS, StarRocks, ADB, etc., typically do not provide native CDC output.
Recommended approach: Capture incremental logs from the upstream business systems (such as MySQL or Oracle) of these AP databases and synchronize them to OceanBase via OMS.
Supported scenarios: OMS supports incremental synchronization from mainstream OLTP databases such as Oracle (including OGG) and MySQL.
For detailed operation instructions, see: Use OMS for data migration
