When using OceanBase Database, you need to adopt different data import strategies based on your specific needs, such as for real-time or delayed data imports. This topic will detail how to import data into OceanBase Database in T+0 and T+1 scenarios and provide examples of best practices for data migration.
Background information
T+0 means that data is immediately processed and imported after it is generated. This typically requires the system to support high-concurrency data write operations and ensure the real-time and immediate availability of data.
T+1 means that data is processed and imported on the next business day after it is generated ("T+1" means the next day after the trading day). Compared with T+0, T+1 scenarios have lower requirements for real-time processing.
Import baseline data
Baseline data refers to the initial dataset of a system or project at a specific time or under certain conditions. It represents the starting state or performance level of the system and serves as a reference and basis for future changes. OceanBase Database provides the following methods for importing baseline data:
Online migration from a database to another database
If your data is located in a source supported by OMS, we recommend that you use OMS to import the data. If OMS does not support your data source, you can use DataX or perform an offline import.
- For more information about DataX, see DataX User Guide.
Offline migration from a file to a database
If OMS does not support your data source, you can export the data as an offline file and then use obloader, DataX, or LOAD DATA to import the data. obloader supports direct load, which can improve the performance of data import.
- For more information about DataX, see DataX User Guide.
- For more information about LOAD DATA LOCAL, see LOAD DATA (Oracle-compatible mode) and LOAD DATA (MySQL-compatible mode).
Import incremental data
T+0 scenario
T+0 means that data is immediately processed and imported after it is generated. You can use the incremental migration feature of OMS to import data to OceanBase Database in real time. You can also use third-party integration tools such as Flink or DataX to import data. DataX and OMS support only data import. Flink supports real-time data processing and transformation, such as data widening and data aggregation, as well as downstream storage, analysis, and services.
T+1 scenario
T+1 means that data is processed and imported on the next business day after it is generated. To optimize data import performance for T+1 scenarios, you can use batch processing:
Offline data processing
You can process data generated in a day and batch import it to OceanBase Database during off-peak hours to minimize the impact on business operations. You can use the incremental migration feature of OMS to import data to OceanBase Database on the next business day, or use third-party integration tools such as Flink or DataX to import data. For example, you can use DataX to synchronize business data to OceanBase Database on a daily basis.
Partition exchange
The target table is a partitioned table that already contains data. For example, each day corresponds to a partition. In this case, you can create a non-partitioned temporary table with the same structure as the target table and insert data into the temporary table. Then, you can perform a partition exchange to quickly exchange data between the temporary table and the target table.
When you directly import new data to an existing partitioned table, the performance may be low. To improve data import efficiency, you can use the partition exchange feature. The steps are as follows:
- Create a non-partitioned temporary table with the same structure as the target partitioned table.
- Insert the data to be imported into the temporary table.
- Use the partition exchange command to quickly exchange data between the temporary table and the specific partition of the target partitioned table.
For more information, see Partition exchange (MySQL-compatible mode) and Partition exchange (Oracle-compatible mode).
Best practices
Build a real-time data warehouse based on DataX and Flink SQL
Data migration method:
Historical data
After you use DataX to export historical data to a CSV file, you can use DataX to import the CSV file to OceanBase Database. We recommend that you use port 2881 to directly connect to OceanBase Database in the configuration file when you export the CSV file by using DataX. If you use port 2883, which is the port of OBProxy, commands may be distributed to another server. In this case, if DataX and the CSV file are not deployed on the other server, the file cannot be found.
Real-time data
You can use Flink SQL to extract real-time data and write the data to OceanBase Database. The response time is in milliseconds. After testing, we found that the data can be written to OceanBase Database within 1 second after it is generated.
Build a real-time data warehouse based on Flink CDC and OceanBase Database
This topic describes how to build a real-time data warehouse in OceanBase Database by using the Streaming OLAP solution. OceanBase Database supports the HTAP feature that allows row-based and column-based data to be stored in the same table. This feature enables OceanBase Database to support transaction processing and complex analysis in one system.
By using Flink CDC, you can synchronize full data and incremental data from MySQL to OceanBase Database to form an ODS data layer for downstream subscription. While subscribing to the data, you can read the data, process it, and enrich it, and then write the data to OceanBase Database to form a DWD data layer. You can aggregate the data in the DWD data layer to form a DWS data layer. At this point, OceanBase Database can provide query services and data consumption. In this solution, OceanBase Database replaces components such as the KV service, analysis service, and Kafka. In addition, each layer in OceanBase Database is queryable, updatable, and correctable. Therefore, if data in a layer is incorrect, you can directly query the table in the layer and correct the data. This improves the efficiency of data troubleshooting.
The following example describes how to aggregate order details and write the data to a statistics table in the DWS layer to obtain the daily sales volume of each shop.
Define the data source of OceanBase CDC, which is a table named oders.
CREATE TABLE dwd_orders ( order_id BIGINT, order_user_id BIGINT, order_shop_id BIGINT, order_product_id VARCHAR, order_fee DECIMAL(20,2), order_updated_time TIMESTAMP(3), pay_id BIGINT, pay_create_time TIMESTAMP(3), PRIMARY KEY (order id) NOT ENFORCED ) WITH ( 'connector' = 'oceanbase-cdc', 'scan.startup.mode' = 'initial', 'username' = 'user@test_tenant', 'password' = 'pmsd', 'tenant-name' = 'test_tenant', 'database-name' = 'test_db', 'table-name' = 'orders', 'hostname' = '127.0.0.1', 'port' = '2881', 'rootserver-list' = '127.0.0.1:2882:2881', 'logproxy.host' = '127.0.0.1', 'logproxy.port' = '2983' );Use FlinkCDC to calculate the sales volume of each shop and write the data to a table in OceanBase Database to form a statistics layer for shop metrics.
CREATE TABLE dws_shops ( shop_id BIGINT, ds STRING, -- Total amount of payments made on the same day paid_buy_fee_sum DECIMAL(20, 2), PRIMARY KEY(shop_id,ds) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://localhost:3306/mydatabase', 'table-name' = 'shops', 'username' = 'user@test_tenant', 'password' = 'pswd' );Read full data and incremental data from the dwd_orders table in real time, aggregate and process the data in real time, and then write the data to the dws_shops table in OceanBase Database. The dws_shops table can be read and processed by another Flink instance to form a result table for the next layer. In this way, you can build a layered real-time data warehouse.
INSERT INTO dws_shops SELECT order_shop_id, date_format(pay_create_time, 'yyyyMMdd') AS ds, -- Correct the date format SUM(order_fee) AS paid_buy_fee_sum FROM dwd_orders WHERE pay_id IS NOT NULL AND order_fee IS NOT NULL GROUP BY order_shop_id, date_format(pay_create_time, 'yyyyMMdd');
Considerations
After you import data, we recommend that you perform a major compaction to improve query performance.
Batch import
After you import data, perform a major compaction.
Real-time import
After you import data in real time, perform a major compaction. If the real-time import is continuous without a beginning or end and no manual major compaction is performed, the system automatically schedules a major compaction.