Data collection is a fundamental part of data management. It refers to the process of obtaining data from data sources (such as databases and log files) and transmitting it to OceanBase Database. Based on the triggering mechanism and data scope, data collection can be categorized into two core types: data subscription (incremental migration) and file import.
Data subscription
Data subscription is also known as incremental migration. It is a technique for continuously capturing incremental changes from data sources. The core of data subscription is to use change data capture (CDC) to real-time or near-real-time transmit data changes (such as additions, updates, and deletions) to the target system. Its core goals are:
- Low-latency synchronization: Ensure data consistency between the target system and the source system.
- Resource efficiency: Transmit only the differential data, not the full data.
- Flexible scalability: Support multiple target systems (such as data warehouses and analytics platforms).
Typical use cases:
- Real-time analysis (such as real-time data synchronization from an order system to a BI platform).
- Disaster recovery and active-active architecture (such as incremental replication from a database to a disaster recovery cluster).
- Cross-system data governance (such as subscribing MySQL data to Hive for offline analysis).
For more information, see Overview of data subscription.
File import
OceanBase Database provides various flexible data import methods. You can import data from multiple data sources to the database. Different import methods are suitable for different scenarios. You can choose the appropriate data import tool based on the data source type and business scenario. As scenarios become more complex, multiple import methods can be used in combination.
When importing data, you need to consider the data source, data file format, and the support of the data import tool. When the business scenario specifies the data source and data file format, you need to design the data import solution based on the data source and the support of the data import tool. When the business has a familiar data import tool, you need to consider the tool's support and design the data import solution based on the business scenario.
Main import methods
- LOAD DATA syntax: Suitable for large-scale data import. It supports CSV, ORC, and Parquet formats.
- OBLoader: A data import tool provided by OceanBase. It supports batch import of data in various file formats.
- External tables: Suitable for data lake analysis scenarios. It does not require actual data import to the database.
- INSERT SQL: Suitable for writing small amounts of data.
- Third-party tools: Such as OMS, DataX, Flink, and Canal. They support data import in different scenarios.
Supported data sources
- File systems: Local files, object storage, and HDFS.
- Databases: Relational databases such as MySQL, Oracle, and PostgreSQL.
- Big data platforms: Platforms such as MaxCompute, StarRocks, Doris, and HBase.
- Real-time data streams: Streaming data such as Kafka and Flink.
For more information, see Overview of data import.
