Data processing refers to the process of cleaning, transforming, integrating, and aggregating raw data to make it more suitable for analysis, application, or presentation. In the AP / Data Warehouse scenario of OceanBase, common data processing paths can be categorized into three types:
- SQL / PL: Use SQL within the tenant to perform data transformation and storage, and orchestrate execution using asynchronous jobs, scheduled tasks, etc.
- Materialized views: Convert repetitive computations into refreshable materialized results, reducing the cost of online analytical processing by combining incremental/real-time processing with query rewriting.
- External computing (e.g., Flink): Perform stream and batch processing or modeling in external engines like Flink, DataWorks, or dbt, and then synchronize or write the results to OceanBase.
SQL/PL batch processing and task scheduling within the database
Perform batch writes, overwrites, and periodic scheduling directly within the database using SQL (and PL packages in Oracle mode). This approach is suitable for scenarios with short data pipelines where computation needs to be close to the data.
INSERT OVERWRITE SELECT operation
The INSERT OVERWRITE SELECT operation is used to insert query results into a target table while overwriting existing data in the target table. It is primarily used in data cleaning and transformation processes to efficiently batch insert processed data into the database. This operation ensures atomicity, meaning that if any exception occurs during execution, the entire operation will be rolled back.
Use cases
- When you need to batch import data and overwrite the target table, the
INSERT OVERWRITE SELECToperation provides a convenient method. - When you need to clean data and replace existing data in the target table with new data.
- The source table can be an internal or external table.
Syntax example
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Use this operation with caution, as it will delete all data in the target table or partition. Ensure that you have backed up your data to prevent accidental data loss.
- Ensure that the data returned by the
SELECTquery is in a format consistent with the target table structure to avoid data insertion failures. - The difference between
INSERT INTO SELECTandINSERT OVERWRITE SELECTis thatINSERT INTO SELECTdoes not clear the target table or partition before inserting data. Processed data is incrementally written to the target table. - Do not use
INSERT OVERWRITE SELECTto write data to the same table as real-time write operations (such asINSERT INTO,REPLACE INTO,DELETE, andUPDATE). Otherwise, the data written by the real-time write operations will be discarded.
For more information, see Use the INSERT OVERWRITE SELECT statement to insert data in MySQL mode and Use the INSERT OVERWRITE SELECT statement to insert data in Oracle mode.
Submit job (Submit Job)
In OceanBase Database, the job submission feature allows users to submit and execute complex data processing tasks. These tasks can include data batch processing, ETL (extract, transform, load) jobs, or other background data processing tasks. When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the default execution mode is synchronous. When the data volume is large, the session may need to stay alive for a long time. Therefore, for large data volumes, it is recommended to submit data import tasks asynchronously.
Use cases
- Regularly execute complex data processing operations, such as batch data import, data cleaning, and report generation.
- Scheduling large-scale data migration or conversion tasks.
Submit job example
-- Submit a job.
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2,c2;
-- Query the status of an asynchronous task.
SHOW JOB STATUS WHERE job='job_id';
Jobs typically include multiple steps, such as data extraction, transformation, and loading, ensuring the automated execution of the entire data processing workflow.
For more information, see SUBMIT JOB (MySQL mode).
Scheduled tasks
Scheduled tasks allow you to execute specific jobs at predefined intervals. By using scheduled tasks, you can automate data processing workflows and reduce manual maintenance costs.
Use cases
- Periodically clean up expired data, generate reports, and synchronize data.
- Automatically execute data processing jobs at specific times or during periodic tasks.
Example of creating a scheduled task
- MySQL mode: Use the
Event Schedulercommand to create a scheduled task.
-- Create a scheduled task
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View the scheduled task
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle mode: Use the
DBMS_SCHEDULERsystem package to create a scheduled task.
-- Create a scheduled task
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- Job name
job_type => 'STORED_PROCEDURE', -- Job type is stored procedure
job_action => 'data_cleaning', -- Stored procedure to execute
number_of_argument => 0, -- No parameters required
start_date => SYSDATE, -- Start time of the job
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- Execute daily
end_date => NULL, -- No end time set, job will run indefinitely
job_class => 'DEFAULT_JOB_CLASS', -- Default job class
enabled => TRUE, -- Enable the job
auto_drop => TRUE, -- Automatically drop the job after completion
comments => 'Daily scheduled data cleaning and conversion', -- Job comments
credential_name => NULL, -- No credentials
destination_name => NULL, -- No destination database
max_run_duration => 0 -- No maximum runtime limit
);
COMMIT;
END;
/
-- View the scheduled task
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT * from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';
Materialized views: precomputation, refresh, and query rewriting
By materializing the results of wide tables, aggregations, and multi-table joins, and refreshing them according to full, incremental, or real-time strategies, you can combine columnar storage and query rewriting to reduce repeated scans and aggregations on the detailed tables.
- Materialized view query acceleration: A set of examples covering
salesanditemstables, including full, incremental, and real-time refreshes, nested views, columnar storage, query rewriting, and indexes. - Materialized views (MySQL mode), Materialized views (Oracle mode): Reference information on concepts, syntax, and dictionary views.
- Materialized view scenarios: Step-by-step SQL examples for order tables,
tbl*, and external tables.
External computing (Flink, etc.): ecosystem and stream-batch integration
Complex stream processing, cross-system ETL, or cloud vendor-managed pipelines are typically performed in OceanBase external computing engines (such as Flink, DataWorks, dbt, and AWS Glue), and then synchronized or loaded into OceanBase.
For more information about tool selection, Flink CDC links, and ODC, see Data processing tools.
References
- Data processing tools (Flink, dbt, DataWorks, AWS Glue, ODC, and so on)
- Query acceleration for materialized views, Materialized views in classic scenarios, Overview of materialized views (MySQL mode), and Overview of materialized views (Oracle mode)
