Data processing involves cleaning, transforming, integrating, and aggregating raw data to make it more suitable for analysis, application, or presentation. In the AP / Data Warehouse scenario of OceanBase Database, common data processing paths can be categorized into three types:
- SQL / PL: Use SQL within the tenant to perform data conversion and storage, and use asynchronous jobs, scheduled tasks, etc., to orchestrate execution.
- Materialized views: Convert repeated computations into refreshable materialized results, and reduce the cost of online analytical processing by combining incremental/real-time processing and query rewriting.
- External computing (e.g., Flink): Perform stream and batch processing or modeling in external engines such as Flink, DataWorks, and dbt, and then synchronize or write the data to OceanBase Database.
SQL/PL batch processing and task scheduling
Use SQL (and PL packages in Oracle mode) directly in the database to perform batch writes, overwrites, and periodic scheduling. This is suitable for scenarios with short data processing chains where you want to keep computations close to the data.
INSERT OVERWRITE SELECT statement
The INSERT OVERWRITE SELECT statement is used to insert the result of a query into a target table, overwriting any existing data in the target table. It is commonly used in data cleaning and transformation processes to efficiently batch insert processed data into a database. This statement ensures atomicity, meaning that if any exception occurs during execution, the entire operation will be rolled back.
Use cases
- When you need to import data in batches and overwrite the target table.
- When you want to replace existing data in the target table with new data during data cleaning.
- When the source table can be an internal or external table.
Syntax
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Use this statement with caution, as it will delete all data from the target table or partition. Make sure to back up your data to prevent accidental data loss.
- Ensure that the data returned by the
SELECTquery matches the structure of the target table to avoid data insertion failures. - The difference between
INSERT INTO SELECTandINSERT OVERWRITE SELECTis thatINSERT INTO SELECTdoes not clear the target table or partition before inserting the processed data, and the processed data is incrementally written to the target table. - Do not use
INSERT OVERWRITE SELECTand real-time write methods (such asINSERT INTO,REPLACE INTO,DELETE, andUPDATE) to write data to the same table at the same time. Otherwise, the real-time write data will be discarded.
For more information, see Use the INSERT OVERWRITE SELECT statement to insert data in MySQL mode and Use the INSERT OVERWRITE SELECT statement to insert data in Oracle mode.
Submit job (Submit Job)
In OceanBase Database, the job submission feature allows users to submit and execute complex data processing tasks. These tasks can include data batch processing, ETL (Extract, Transform, Load) jobs, or other background data processing operations. When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the default execution mode is synchronous. For large data volumes, this may require the session to remain active for an extended period. Therefore, for large data scenarios, it is recommended to submit data import tasks asynchronously.
Use cases
- Regular execution of complex data processing operations, such as batch data import, data cleaning, and report generation.
- Scheduling large-scale data migration or transformation tasks.
Submit job example
-- Submit a job.
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2,c2;
-- Query the status of an asynchronous task.
SHOW JOB STATUS WHERE job='job_id';
Jobs typically consist of multiple steps, such as data extraction, transformation, and loading, ensuring the entire data processing workflow is automated.
For more information, see SUBMIT JOB (MySQL mode).
Scheduled tasks
A scheduled task is a task that is executed periodically at a specified time. By using scheduled tasks, you can automate data processing and reduce the maintenance cost.
Scenarios
- Periodically clean up expired data, generate reports, and synchronize data.
- Automatically execute data processing tasks at specific times or periodically.
Example
- MySQL mode: Use the
Event Schedulercommand to create a scheduled task.
-- Create a scheduled task.
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View the scheduled task.
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle mode: Use the
DBMS_SCHEDULERsystem package to create a scheduled task.
-- Create a scheduled task.
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- The name of the job.
job_type => 'STORED_PROCEDURE', -- The type of the job is a stored procedure.
job_action => 'data_cleaning', -- The stored procedure to be executed.
number_of_argument => 0, -- No parameters.
start_date => SYSDATE, -- The start time of the job.
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- The job is executed once a day.
end_date => NULL, -- The end time of the job is not set. The job will be executed continuously.
job_class => 'DEFAULT_JOB_CLASS', -- The category of the job is default.
enabled => TRUE, -- The job is enabled.
auto_drop => TRUE, -- The job is automatically deleted after it is executed.
comments => 'Daily scheduled data cleaning and conversion', -- The comment of the job.
credential_name => NULL, -- No credential.
destination_name => NULL, -- No destination database.
max_run_duration => 0 -- No limit on the maximum running duration.
);
COMMIT;
END;
/
-- View the scheduled task.
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT * from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';
Materialized views: precomputation, refresh, and query rewriting
Materialized views allow you to materialize results of wide tables, aggregations, and multi-table joins. You can refresh the materialized views in full, incremental, or real-time mode. You can also combine materialized views with columnar storage and query rewriting to reduce repeated scans and aggregations on detail tables.
- Materialized view query acceleration: A series of examples that cover full, incremental, and real-time refreshes, nested views, columnar storage, query rewriting, and indexes.
- Overview of materialized views (MySQL mode), Overview of materialized views (Oracle mode): Concepts, syntax, and dictionary views.
- Classic scenarios of materialized views: Step-by-step SQL practices for order tables,
tbl*, and external tables.
External computing (Flink, etc.): ecosystem and unified stream and batch processing
Complex stream processing, cross-system ETL, or cloud vendor-managed pipelines are typically performed in OceanBase external computing engines (such as Flink, DataWorks, dbt, and AWS Glue), and then synchronized or loaded into OceanBase.
For more information about tool selection, Flink CDC links, and ODC, see Data processing tools.
References
- Data processing tools (Flink, dbt, DataWorks, AWS Glue, ODC, and so on)
- Accelerate queries by using materialized views, Best practices for materialized views, Overview of materialized views (MySQL mode), and Overview of materialized views (Oracle mode)
