Migrating data from the big data ecosystem refers to the process of transferring data that already exists in offline data warehouses (such as Hive, HBase, ORC/Parquet files, etc.) or big data platform components, rather than directly from traditional OLTP databases or real-time message queues. The core objective of this type of migration is to efficiently and reliably integrate or migrate this data, which has already been processed or stored within the big data ecosystem, into OceanBase Database or other target systems. This supports advanced analytics, real-time querying, and seamless integration with other business systems.
Migration scenarios
- Synchronize offline data warehouse data to a real-time analytics system: Synchronize data from offline data warehouses such as Hive and Spark to real-time data processing platforms such as Flink and Kafka to support real-time data processing and streaming analytics.
- Archive offline data to long-term storage: Archive historical data from offline data warehouses to long-term storage solutions such as cloud object storage (S3) or Alibaba Cloud OSS to reduce storage costs and meet data retention policies.
- Integrate offline data with external systems: Migrate Hive data to external systems such as Snowflake for cross-team collaboration, or integrate data from the big data platform with business systems.
In these data migration scenarios, the following considerations are important:
- Storage format conversion: When converting between formats such as Parquet to Avro, factors like compression rates, query performance, and compatibility must be considered.
- Efficient transfer of large volumes of data: Use distributed tools like DistCp or Spark, leveraging parallel processing and network optimization to enhance transfer efficiency.
- Metadata synchronization: Ensure that metadata from the source (such as Hive Metastore) including table structures and partition information, is correctly synchronized and mapped to the target table structures in OceanBase Database.
Related migration documentation
OceanBase provides various solutions for integrating data from the big data ecosystem. You can refer to the corresponding migration operation documentation based on the type of data source:
Migrate data from a Hive data warehouse
Hive, a commonly used data warehouse, supports the following two migration methods:
- Migrate Hive data using OMS Community Edition
- Migrate Hive data using OBLoader
Migrate data from an HBase database
HBase, a distributed NoSQL database, supports the following migration method:
- Migrate HBase data using OMS Community Edition
Migrate data from files
For data stored in file systems, various formats can be migrated:
- Migrate ORC/Parquet/CSV files using OBLoader
Migrate data from Alibaba Cloud AnalyticDB for MySQL
You can use the data migration feature in the Alibaba Cloud OceanBase Database console to migrate data from Alibaba Cloud AnalyticDB for MySQL:
- Migrate data from an AnalyticDB for MySQL instance to a MySQL-compatible tenant of OceanBase Database
Migrate data from a StarRocks database
Scenario 1: Synchronize data for consistency
The data in RDS (Relational Database Service) and StarRocks is completely consistent, but they serve different purposes (RDS for OLTP and StarRocks for OLAP analytics).
- Migration solution: Refer to Migrate heterogeneous data.
Scenario 2: Synchronize data with differences
The data in StarRocks is not completely consistent with that in RDS. For example: - StarRocks stores historical data for a longer period. - StarRocks data has been processed using tools like Flink (such as aggregation, ETL, and addition of new calculated fields).
As a high-performance analytical database, StarRocks is recommended for migration using the Flink-OMT tool.