Data processing involves cleaning, transforming, integrating, and aggregating raw data to make it more suitable for analysis, application, or presentation. In the AP / Data Warehouse scenario of OceanBase, common data processing paths can be categorized into three types:
- SQL / PL: Use SQL within the tenant to perform transformations and data storage, and orchestrate execution using asynchronous jobs, scheduled tasks, etc.
- Materialized Views: Convert repeated computations into refreshable materialized results, reducing the cost of online analytical processing by combining incremental/real-time updates with query rewriting.
- External Computation (e.g., Flink): Perform stream and batch processing or modeling in external engines such as Flink, DataWorks, and dbt, and then synchronize or write the results to OceanBase.
SQL/PL batch processing and task scheduling within the database
Use SQL (and PL packages in Oracle mode) directly within the database to perform batch writes, overwrites, and periodic scheduling. This is suitable for scenarios with short data processing chains where computation needs to be close to the data.
INSERT OVERWRITE SELECT operation
The INSERT OVERWRITE SELECT operation is used to insert the results of a query into a target table and overwrite the existing data in the target table. It is primarily used in data cleaning and transformation processes to efficiently batch insert processed data into the database. This operation ensures atomicity, meaning that if any exceptions occur during execution, the entire operation will be rolled back.
Use cases
- When you need to batch import data and overwrite the target table, the
INSERT OVERWRITE SELECToperation provides a convenient method. - When you need to clean data and replace existing data in the target table with new data.
- When the source table can be an internal or external table.
Syntax example
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Use this operation with caution, as it will delete all data from the target table or partition. Ensure that you have backed up your data to prevent accidental data loss.
- Ensure that the data returned by the
SELECTquery is in a format that matches the structure of the target table to avoid data insertion failures. - The difference between
INSERT INTO SELECTandINSERT OVERWRITE SELECTis thatINSERT INTO SELECTdoes not clear the target table or partition before inserting the processed data. Instead, it incrementally writes the processed data to the target table. - Do not use
INSERT OVERWRITE SELECTand real-time write methods (such asINSERT INTO,REPLACE INTO,DELETE, andUPDATE) to write data to the same table simultaneously. Otherwise, the data written by the real-time write methods will be discarded.
For more information, see Use the INSERT OVERWRITE SELECT statement to insert data in MySQL mode and Use the INSERT OVERWRITE SELECT statement to insert data in Oracle mode.
Submit job (Submit Job)
In OceanBase Database, the job submission feature allows users to submit and execute complex data processing tasks. These tasks can include data batch processing, ETL (Extract, Transform, Load) jobs, or other background data processing operations. When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the default execution mode is synchronous. For large data volumes, this may require the session to remain active for an extended period. Therefore, for large data scenarios, it is recommended to submit data import tasks asynchronously.
Use cases
- Periodically execute complex data processing operations, such as batch data imports, data cleaning, and report generation.
- Scheduling large-scale data migration or conversion tasks.
Submit job example
-- Submit a job.
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2,c2;
-- Query the status of the asynchronous task.
SHOW JOB STATUS WHERE job='job_id';
Jobs typically consist of multiple steps, such as data extraction, transformation, and loading, to ensure the automated execution of the entire data processing workflow.
For more information about the syntax and behavior, see SUBMIT JOB (MySQL mode).
Scheduled tasks
Scheduled tasks allow you to execute specific jobs at predefined intervals. By using scheduled tasks, you can automate data processing workflows and reduce manual maintenance costs.
Use cases
- Regularly clean up expired data, generate reports, and synchronize data.
- Automatically execute data processing jobs at specific times or during periodic tasks.
Example of creating a scheduled task
- MySQL mode: Use the
Event Schedulercommand to create a scheduled task.
-- Create a scheduled task
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View the scheduled task
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle mode: Use the
DBMS_SCHEDULERsystem package to create a scheduled task.
-- Create a scheduled task
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- Job name
job_type => 'STORED_PROCEDURE', -- Job type is stored procedure
job_action => 'data_cleaning', -- Stored procedure to execute
number_of_argument => 0, -- No parameters required
start_date => SYSDATE, -- Start time of the job
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- Execute daily
end_date => NULL, -- No end time set, job will run indefinitely
job_class => 'DEFAULT_JOB_CLASS', -- Default job class
enabled => TRUE, -- Enable the job
auto_drop => TRUE, -- Automatically drop the job after completion
comments => 'Daily scheduled data cleaning and conversion', -- Job comment
credential_name => NULL, -- No credentials
destination_name => NULL, -- No destination database
max_run_duration => 0 -- No maximum runtime limit
);
COMMIT;
END;
/
-- View the scheduled task
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT * from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';
Materialized views: precomputation, refresh, and query rewriting
Materialized views allow you to materialize results of wide tables, aggregations, and multi-table joins. You can refresh the materialized views in full, incremental, or real-time mode. By combining columnar storage and query rewriting, you can reduce repeated scans and aggregations on the detail tables.
- Accelerate queries using materialized views: A series of examples using the
salesanditemstables (full, incremental, real-time, nested, columnar, query rewriting, and index). - Overview of materialized views (MySQL mode), Overview of materialized views (Oracle mode): Concepts, syntax, and dictionary views.
- Classic scenarios of materialized views: Step-by-step SQL practices for order tables,
tbl*, and external tables.
External computing (Flink, etc.): ecosystem and unified stream and batch processing
Complex stream processing, cross-system ETL, or cloud vendor-managed pipelines are typically completed in OceanBase external computing engines (such as Flink, DataWorks, dbt, AWS Glue, etc.), and then synchronized or loaded into OceanBase.
For more information about tool selection, Flink CDC links, and ODC, see Data processing tools.
References
- Data processing tools (Flink, dbt, DataWorks, AWS Glue, ODC, and so on)
- Query acceleration for materialized views, Best practices for materialized views, Overview of materialized views (MySQL mode), and Overview of materialized views (Oracle mode)
