Data processing refers to a series of operations that transform raw data into a structured, high-quality, and easily usable dataset, making it more suitable for specific analysis, applications, or presentations. These operations typically include data cleaning, data transformation, data integration, data aggregation, data enrichment, data filtering, and data annotation. The goal is to convert messy, unstructured data into a format that is easier to work with and provides better query and analysis performance.
The execution of an efficient data processing task relies on several key factors: high-performance raw data reading, a rich set of functions, efficient query analysis, and high-performance data writing. Additionally, features that align with business characteristics play a crucial role in enhancing the effectiveness of data processing tasks.
This topic will focus on the features of OceanBase Database that support data processing tasks.
INSERT OVERWRITER SELECT operation
The INSERT OVERWRITER SELECT operation is used to insert query results into a target table while overwriting existing data in the target table. It is primarily used in data cleaning and transformation processes to efficiently batch insert processed data into the database. This operation ensures atomicity, meaning that if any exceptions occur during execution, the entire operation will be rolled back.
Use cases
- The
INSERT OVERWRITER SELECToperation provides a convenient method for bulk importing data and overwriting the target table. - It is suitable for data cleaning when the target table already contains data and you want to replace the old data with new data.
- The source table can be an internal or external table in the database.
Syntax example
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Use this operation with caution, as it will delete all data from the target table or partition. Make sure to back up the data to prevent accidental data loss.
- Ensure that the data format returned by the
SELECTquery matches the structure of the target table to avoid data insertion failures. - The difference between
INSERT INTO SELECTandINSERT OVERWRITE SELECTis thatINSERT INTO SELECTdoes not clear the target table or partition before inserting data. Instead, the processed data is incrementally written to the target table. - Do not write data to the same table using both
INSERT OVERWRITE SELECTand real-time write methods (INSERT INTO,REPLACE INTO,DELETE,UPDATE). Otherwise, the real-time data will be discarded.
For more information about this operation, see Use the INSERT OVERWRITE SELECT statement to insert data in MySQL-compatible mode and Use the INSERT OVERWRITE SELECT statement to insert data in Oracle-compatible mode.
Submit a job (Submit Job)
In OceanBase Database, the job submission feature allows users to submit and execute complex data processing tasks. These tasks can include data batch processing, ETL (Extract, Transform, Load) jobs, or other background data processing operations. When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the default execution mode is synchronous. When the amount of data is large, the session may need to stay alive for a long time. Therefore, for large-scale data scenarios, we recommend that you submit data import tasks in asynchronous mode.
Use cases
- Regularly execute complex data processing operations, such as batch data import, data cleaning, and report generation.
- Schedule large-scale data migration or conversion tasks.
Submit a job example
-- Submit a job.
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2,c2;
-- Query the status of an asynchronous task.
SHOW JOB STATUS WHERE job='job_id';
A job typically includes multiple steps, such as data extraction, transformation, and loading, to ensure the automation of the entire data processing process.
Scheduled tasks
Scheduled tasks are used to periodically execute specified jobs according to a set schedule. By using scheduled tasks, you can automate data processing workflows and reduce manual maintenance costs.
Use cases
- Regularly clean up expired data, generate reports, and synchronize data.
- Automatically execute data processing jobs at specific times or in periodic tasks.
Create a scheduled task example
- MySQL-compatible mode: Use the
Event Schedulercommand to create a scheduled task.
-- Create a scheduled task.
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View the scheduled task.
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle-compatible mode: Use the
DBMS_SCHEDULERsystem package to create a scheduled task.
-- Create a scheduled task.
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- The name of the job.
job_type => 'STORED_PROCEDURE', -- The type of the job is a stored procedure.
job_action => 'data_cleaning', -- The stored procedure to be executed.
number_of_argument => 0, -- No parameters are required.
start_date => SYSDATE, -- The start time of the job.
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- The job is executed every day.
end_date => NULL, -- The job does not have an end time and will run indefinitely.
job_class => 'DEFAULT_JOB_CLASS', -- The job class is the default class.
enabled => TRUE, -- The job is enabled.
auto_drop => TRUE, -- The job is automatically deleted after it is completed.
comments => 'Daily data cleaning and conversion', -- The comment of the job.
credential_name => NULL, -- No credential is specified.
destination_name => NULL, -- No destination database is specified.
max_run_duration => 0 -- No maximum running time is specified.
);
COMMIT;
END;
/
-- View the scheduled task.
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';