Integrating data in a big data ecosystem refers to the migration of data that already exists in an offline data warehouse (such as Hive, HBase, or ORC/Parquet files) or a big data platform component, rather than directly from a traditional OLTP database or a real-time message queue. The core goal of this type of migration is to efficiently and reliably integrate or migrate the data that has been processed or stored in the big data ecosystem to OceanBase AP or another target system, to support advanced analysis, real-time queries, or collaboration with other business systems.
Migration scenarios
- Synchronize data from an offline data warehouse to a real-time analytics system: Synchronize data from an offline data warehouse such as Hive and Spark to a real-time data processing platform such as Flink or Kafka to support real-time data processing and streaming analytics.
- Archive offline data to long-term storage: Archive historical data from an offline data warehouse to long-term storage such as S3 or Alibaba Cloud OSS to reduce storage costs and meet data retention policies.
- Integrate offline data with external systems: Migrate Hive data to an external system such as Snowflake for cross-team collaboration, or integrate data from a big data platform with business systems.
In these types of data migration scenarios, the following considerations apply:
- Storage format conversion: For example, converting data from Parquet to Avro requires considering the compression rates, query performance, and compatibility of different formats.
- Efficient transfer of large volumes of data: Use distributed tools such as DistCp or Spark for parallel processing and network optimization to improve transfer efficiency.
- Metadata synchronization: Ensure that the metadata (such as table schemas and partition information) from the source (such as Hive Metastore) is correctly synchronized and mapped to the target schema in OceanBase AP.
Related migration documents
OceanBase provides various solutions for integrating data in a big data ecosystem. You can view the corresponding migration operation documents based on the data source type:
Migrate data from a Hive data warehouse
Hive, a commonly used data warehouse, supports the following two migration methods:
Migrate data from an HBase database
As a distributed NoSQL database, HBase supports the following migration method:
Migrate data from files
For data stored in a file system, multiple formats are supported:
Migrate data from Alibaba Cloud AnalyticDB for MySQL
You can use the data migration feature of the OceanBase console to migrate data from Alibaba Cloud AnalyticDB for MySQL:
Migrate data from a StarRocks database
Scenario 1: Synchronize data for consistency
Data in RDS (Relational Database Service) is completely consistent with that in StarRocks, but they serve different purposes (RDS is used for OLTP, and StarRocks is used for OLAP analysis).
- Migration solution: Refer to Heterogeneous data migration.
Scenario 2: Synchronize only the data that differs
Data in StarRocks is not completely consistent with that in RDS. For example: - StarRocks stores historical data for a longer period. - Data in StarRocks has been processed by tools such as Flink (such as aggregated, ETL, or new computed fields).
As a high-performance analytical database, we recommend that you use the Flink-OMT tool for migration:
- Migration solution: Refer to Use Flink-OMT to migrate data from StarRocks.
