Data processing refers to the series of operations and transformations performed on raw data to make it more suitable for specific analysis, applications, or presentation. This typically includes data cleaning, data transformation, data integration, data aggregation, data enrichment, data filtering, data annotation, and other operations. These processes convert raw, disorganized data into structured, high-quality, and easy-to-use datasets, which in turn enable simpler and higher-performance query and analysis capabilities.
The efficient execution of a data processing task relies on high-performance raw data reading, a rich set of functions, efficient query analysis, and high-performance data writing capabilities. It also depends on certain peripheral features tailored to business characteristics.
This topic will focus on introducing the features of OceanBase that are beneficial for data processing workloads.
INSERT OVERWRITER SELECT operation
The INSERT OVERWRITER SELECT statement inserts the result of a query into a target table and overwrites the existing data in the target table. It is commonly used for data cleansing and transformation, allowing you to efficiently insert processed data into a database in batches. This operation ensures atomicity, which means that if an exception occurs during execution, the operation will be rolled back as a whole.
Scenarios
INSERT OVERWRITER SELECTprovides a convenient way to batch import data and overwrite the target table.- It is suitable for data cleansing when the target table contains existing data that you want to replace with new data.
- The source table can be an internal table or an external table.
Syntax
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Exercise caution when using this operation, as it will delete all data in the target table or partition. Ensure that you have backed up your business data to prevent accidental deletion.
- Make sure that the data format returned by the
SELECTquery matches the schema of the target table to avoid insertion failures. - Compared with
INSERT INTO SELECT,INSERT INTO SELECTdoes not clear the target table or partition before insertion. Instead, it incrementally writes processed data to the target table. - Do not write data to the same table using both
INSERT OVERWRITE SELECTand real-time write methods (INSERT INTO, REPLACE INTO, DELETE, and UPDATE) simultaneously. Otherwise, the data written in real time will be discarded.
For more information about this operation, see Insert data by using the INSERT OVERWRITE SELECT statement in MySQL-compatible mode and Insert data by using the INSERT OVERWRITE SELECT statement in Oracle-compatible mode.
Job submission (Submit Job)
In OceanBase Database, the Submit Job feature allows you to submit and execute complex data processing tasks, such as data batch processing, ETL (Extract, Transform, Load) jobs, or other background data processing tasks. When you use INSERT INTO or INSERT OVERWRITE SELECT to import data, the process is typically executed synchronously. However, when a large amount of data is involved, the session may need to maintain a long-lived connection. In scenarios involving a large amount of data, it is recommended to submit the data import task asynchronously.
Scenarios
- Regularly perform complex data processing operations such as batch data import, data cleansing, and report generation.
- Schedule large-scale data migration or transformation tasks.
Examples
-- Submit an job
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2,c2;
-- Query the status of an asynchronous task
SHOW JOB STATUS WHERE job='job_id';
Generally, an job consists of multiple steps, such as data extraction, transformation, and loading, to ensure the automation of the entire data processing workflow.
Scheduled tasks
A scheduled task is a task that is executed periodically based on a predefined schedule. You can create scheduled tasks to automate the data processing workflow and reduce the costs of manual maintenance.
Use cases
- Regular data cleanup, report generation, and data synchronization
- Automatic execution of data processing tasks at specific times or during periodic tasks
Examples of creating scheduled tasks
- MySQL-compatible mode: Create a scheduled task by using the
Event Schedulercommand.
-- Create a scheduled task
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View scheduled tasks
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle-compatible mode: Create a scheduled task by using the
DBMS_SCHEDULERpackage.
-- Create a scheduled task
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- The job name.
job_type => 'STORED_PROCEDURE', -- The job type. STORED_PROCEDURE indicates a stored procedure.
job_action => 'data_cleaning', -- The name of the stored procedure that is executed.
number_of_argument => 0, -- No parameters are required.
start_date => SYSDATE, -- The start time when the job is to be executed.
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- The job is to be executed daily.
end_date => NULL, -- The end time is not set. The job will be executed indefinitely.
job_class => 'DEFAULT_JOB_CLASS', -- The job category is default.
enabled => TRUE, -- The job is enabled.
auto_drop => TRUE, -- The job is to be automatically deleted after it is completed.
comments => 'Clean and convert data on a daily basis.', -- The job comments.
credential_name => NULL, -- No credential is used.
destination_name => NULL, -- No destination database is specified.
max_run_duration => 0 -- No maximum running duration is specified.
);
COMMIT;
END;
/
-- View scheduled tasks
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT * from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';