Data collection is a fundamental part of data management. It refers to the process of obtaining data from data sources (such as databases and log files) and transmitting it to OceanBase Database. Based on the triggering mechanism and scope of data collection, data collection can be divided into two core types: data subscription (incremental migration) and file import.
Data subscription
Data subscription (Data Subscription), also known as incremental migration, is a technique for continuously capturing incremental changes from a data source. The core of data subscription is to use change data capture (CDC) to real-time or near-real-time transmit data changes (additions, updates, and deletions) to the target system. The core goals are:
- Low-latency synchronization: Ensure data consistency between the source and target systems.
- Resource efficiency: Transmit only the differential data, not the full data.
- Flexible scalability: Support multiple target systems (such as data warehouses and analytics platforms).
Typical applications:
- Real-time analysis (such as real-time data synchronization from an order system to a BI platform).
- Disaster recovery and active-active architecture (such as incremental replication from a database to a disaster recovery cluster).
- Cross-system data governance (such as subscribing MySQL data to Hive for offline analysis).
For more information, see Overview of data subscription.
File import
OceanBase Database provides flexible data import methods to import data from various data sources into the database. Different import methods are suitable for different scenarios. You can choose an appropriate data import tool based on the data source type and business scenario. With the complexity of scenarios, multiple import methods can be used together.
When importing data, consider the data source, data file format, and the support of the data import tool. If the business scenario has a clear data source and data file format, you need to consider the data import solution from the data source and in combination with the data import tool. If the business has a familiar data import tool, you need to consider the support of the tool and the possibility of data import in combination with the business scenario.
Main import methods
- LOAD DATA syntax: Suitable for large-scale data import. It supports CSV, ORC, and Parquet formats.
- OBLoader: A data import tool provided by OceanBase Database. It supports batch import of data in various file formats.
- External tables: Suitable for data lake analysis. Data does not need to be imported into the database.
- INSERT SQL: Suitable for writing small amounts of data.
- Third-party tools: Such as OMS, DataX, Flink, and Canal. They support data import in different scenarios.
Supported data sources
- File systems: Local files, object storage, and HDFS.
- Databases: Relational databases such as MySQL, Oracle, and PostgreSQL.
- Big data platforms: MaxCompute, StarRocks, Doris, and HBase.
- Real-time data streams: Kafka and Flink.
For more information, see Overview of data import.