Overview
This topic provides end-to-end best practices for migrating data from mainstream analytical processing (AP) databases, including Huawei Cloud DWS, StarRocks, and Alibaba Cloud AnalyticDB for MySQL (ADB), to OceanBase Database.
OceanBase Database supports efficient and scalable direct load through external tables, making it suitable for large-scale data migration scenarios.
Overall migration solution
We recommend the "export-transfer-import" three-phase architecture:
- Export phase: In the source AP database, use a writable external table to export data as a structured file (such as Parquet or ORC) and store it in an object storage service (such as S3 or OSS) or HDFS.
- Transfer phase: The data is temporarily stored in an object storage service or distributed file system, serving as an intermediate medium.
- Import phase: OceanBase Database reads the file directly through a read-only external table and uses direct load capabilities to efficiently write the data to the target table.
This solution offers the following advantages:
- Decouples the source and target systems, minimizing the impact on the production environment;
- Utilizes columnar storage formats to improve I/O and compression efficiency;
- Supports parallel direct load, significantly boosting throughput.
Notice
Currently, OceanBase Database 4.4.x primarily supports full direct load. For incremental data, we recommend using OceanBase Migration Service (OMS).
Export data from the source database
Supported databases
The following AP databases natively support data export through external tables:
- Huawei Cloud GaussDB(DWS)
- StarRocks
- Alibaba Cloud AnalyticDB for MySQL (ADB)
Recommended export format
Prioritize columnar storage formats: Parquet or ORC.
Advantages of these formats
- High compression rate: Significantly reduces the volume of intermediate data, saving storage and network bandwidth;
- Strong type and nested structure support: Avoids issues with delimiters and escaping in text formats like CSV;
- High-performance read and write: Compared to CSV, import/export performance improves by approximately 20%–30%, making it ideal for wide tables and complex data types.
Official documentation
| Database | Documentation Link |
|---|---|
| DWS | DWS Data Export Guide |
| StarRocks | Unload data using INSERT INTO FILES |
| ADB | Create an external table Export data to OSS |
Export examples
StarRocks export example
INSERT INTO FILES(
"path" = "s3://mybucket/unload/data1",
"format" = "parquet",
"compression" = "uncompressed",
"target_max_file_size" = "1024", -- Unit: bytes (example: 1 KB)
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy",
"aws.s3.region" = "us-west-2"
)
SELECT * FROM sales_records;
ADB export example
CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_demo.osstest3 (
A STRUCT<var1:STRING, var2:INT>
)
STORED AS PARQUET
LOCATION 'oss://testBucketName/osstest/Parquet';
INSERT INTO adb_external_demo.osstest3 SELECT * FROM t1;
Note
Make sure the target bucket has the correct access permissions and verify network connectivity.
Import full data to OceanBase
OceanBase Database provides two main methods for importing data from object storage:
Use INSERT INTO URL external tables
This method is suitable for directly specifying file paths. It offers flexible syntax and supports regular expression matching for filenames.
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1
SELECT * FROM FILES(
location = 's3://data/?host=xxx&access_id=xxx&access_key=xxx',
format = (TYPE = 'PARQUET'),
pattern = 'datafiles$'
);
Use LOAD DATA statements
This method is ideal for standardized batch import tasks.
LOAD DATA /*+ direct(true,0) parallel(2) */
FROM 's3://data/?host=xxx&access_id=xxx&access_key=xxx'
INTO TABLE t1
FORMAT (type = 'PARQUET');
Note
Both methods support direct(true) direct write mode and parallelism control, significantly enhancing import performance.
For detailed syntax and parameter information, see: Overview of full direct load
Directly read external data sources (advanced scenarios)
In addition to object storage, OceanBase Database also supports direct access to heterogeneous data sources like HDFS and ODPS (MaxCompute) through external tables, making it suitable for users with existing data lake architectures.
HDFS data source
Syntax example:
INSERT INTO t1
SELECT * FROM files(
LOCATION = 'hdfs://${namenode_host}:${namenode_port}/user?principal=hdfs/hadoop@EXAMPLE.COM&keytab=/path/to/hdfs.keytab&krb5conf=/path/to/krb5.conf&configs=dfs.data.transfer.protection=authentication,privacy',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '\''
),
PATTERN = 'test_tbl1.csv'
);
ODPS (MaxCompute) data source
Syntax example:
INSERT INTO t1
SELECT * FROM SOURCE (
TYPE = 'ODPS',
ACCESSID = 'your_id',
ACCESSKEY = 'your_key',
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
PROJECT_NAME = 'my_project',
TABLE_NAME = 'my_table'
);
Reference documentation: CREATE EXTERNAL TABLE (MySQL-compatible mode)
Incremental data synchronization strategy
The direct load feature in OceanBase Database 4.4.x only supports full load. For continuous write scenarios, we recommend using the "full load + incremental load" hybrid approach.
Full load at the partition level (with limited support for incremental loads)
If the source data is partitioned by time (e.g., monthly partitions), you can perform a full load on new partitions:
-- Append data to specified partitions
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ INTO t1 PARTITION(p0, p1)
SELECT * FROM external_table;
-- Overwrite data in specified partitions
INSERT /*+ direct(true, 0) enable_parallel_dml parallel(32) */ OVERWRITE t1 PARTITION(p0, p1)
SELECT * FROM external_table;
Limitations
- Table-level locking: Even when specifying partitions, the import process will lock the entire table, preventing parallel imports across multiple partitions;
- Hidden table overhead: A temporary hidden table must be created for the entire table during each import. When there are a large number of partitions, this significantly increases metadata operation time.
Recommendation: For highly partitioned tables, you can use Partition Exchange to avoid these issues.
Use OMS for incremental synchronization
For real-time synchronization of incremental changes, we recommend using OceanBase Migration Service (OMS):
Prerequisites: The source database must support CDC (Change Data Capture);
AP Database Limitations: DWS, StarRocks, ADB, and other AP databases typically do not provide native CDC outputs;
Best Practice: Capture incremental logs from the upstream business systems (e.g., MySQL, Oracle) of these AP databases and synchronize them to OceanBase via OMS.
Supported Databases: OMS supports incremental synchronization from OLTP databases such as Oracle (including OGG) and MySQL.