Data processing involves a series of processing and transformation steps to make the data suitable for specific analysis, applications, or display. It typically includes operations such as data cleansing, data transformation, data integration, data aggregation, data enrichment, data filtering, and data annotation. The goal is to convert raw, unstructured data into a structured, high-quality dataset that is easy to use. This enables more efficient and powerful query analysis.
The execution of an efficient data processing task relies on high-performance data reading, a rich set of functions, efficient query analysis, and high-performance data writing. It also depends on features that are tailored to specific business requirements.
This topic will focus on highlighting the features of OceanBase Database that are particularly beneficial for data processing tasks.
INSERT OVERWRITE SELECT operation
The INSERT OVERWRITE SELECT operation is used to insert the query results into the destination table while overwriting the existing data in the destination table. It is mainly used in data cleansing and transformation processes to efficiently batch insert the processed data into the database. This operation ensures atomicity, meaning that if any exception occurs during the execution, the operation will be rolled back as a whole.
Scenarios
- When you need to batch import data and overwrite the destination table, the
INSERT OVERWRITE SELECToperation provides a convenient method. - When the destination table already contains data and you want to replace the old data with new data during data cleansing.
- The source table can be an internal table or an external table in the database.
Syntax example
INSERT [/*+PARALLEL(N)*/] OVERWRITE table_name [(list_of_columns)] [PARTITION (partition_name, ...)] select_stmt;
Considerations
- Exercise caution when using this operation, as it will delete all data in the destination table or partition. Ensure that your business has backed up data to prevent accidental data loss.
- Make sure that the data format returned by the
SELECTquery is consistent with the schema of the destination table to avoid data insertion failures. - The difference between
INSERT INTO SELECTandINSERT OVERWRITE SELECTis thatINSERT INTO SELECTdoes not clear the destination table or partition before inserting the processed data, and the processed data is incrementally written to the destination table. - Do not use
INSERT OVERWRITE SELECTand real-time write methods (INSERT INTO, REPLACE INTO, DELETE, UPDATE) to write data to the same table at the same time. Otherwise, the real-time data will be discarded.
For more information about this operation, see Insert data by using the INSERT OVERWRITE SELECT statement in MySQL mode and Insert data by using the INSERT OVERWRITE SELECT statement in Oracle mode.
Submit a job (Submit Job)
In OceanBase Database, the job submission feature allows you to submit and execute complex data processing tasks. These tasks can include data batch processing, ETL (Extract, Transform, Load) jobs, or other background data processing tasks. When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the default execution mode is synchronous. When a large amount of data is involved, the session may remain active for a long time. Therefore, for large-scale data import scenarios, it is recommended to submit data import tasks asynchronously.
Scenarios
- Regularly execute complex data processing operations, such as batch data import, data cleansing, and report generation.
- Schedule large-scale data migration or conversion tasks.
Submit a job example
-- Submit a job
SUBMIT JOB INSERT OVERWRITE test
SELECT ex1.c1,t2.c2,sum(ex1.c3)
FROM ex1,t2
WHERE ex1.id = t2.id
GROUP BY ex1.c1,t2.c2;
-- Query the status of an asynchronous task
SHOW JOB STATUS WHERE job='job_id';
A job typically consists of multiple steps, such as data extraction, transformation, and loading, to ensure the automation of the entire data processing process.
Scheduled tasks
Scheduled tasks are used to periodically execute specified jobs based on the set schedule. Scheduled tasks can automate data processing workflows, reducing the costs associated with manual maintenance.
Scenarios
- Periodically clearing expired data, generating reports, and synchronizing data.
- Automatically executing data processing jobs at specific times or on a periodic basis.
Example
- MySQL mode: Create scheduled tasks using the
Event Schedulercommand.
-- Create a scheduled task
CREATE EVENT myevent
ON SCHEDULE AT '2024-12-26 00:00:00' + INTERVAL 1 DAY
DO
INSERT INTO test SELECT * FROM ex_test WHERE date_key = curdate();
-- View scheduled tasks
SELECT * FROM information_schema.events WHERE event_name='myevent';
- Oracle mode: Create scheduled tasks using the
DBMS_SCHEDULERsystem package.
-- Create a scheduled task
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'myscheduler', -- Job name
job_type => 'STORED_PROCEDURE', -- Job type is stored procedure
job_action => 'data_cleaning', -- Stored procedure for job execution
number_of_argument => 0, -- No parameters required
start_date => SYSDATE, -- Start date for job execution
repeat_interval => 'FREQ=DAILY; INTERVAL=1', -- Execute daily
end_date => NULL, -- No end date, job runs indefinitely
job_class => 'DEFAULT_JOB_CLASS', -- Default job class
enabled => TRUE, -- Enable the job
auto_drop => TRUE, -- Automatically drop the job after execution
comments => 'Daily scheduled data cleaning and transformation', -- Job comment
credential_name => NULL, -- No credential
destination_name => NULL, -- No destination database
max_run_duration => 0 -- No maximum run duration limit
);
COMMIT;
END;
/
-- View scheduled tasks
SELECT * FROM DBA_SCHEDULER_JOBS WHERE job_name = 'myscheduler';
SELECT * from DBA_SCHEDULER_JOB_RUN_DETAILS WHERE job_name ='myscheduler';
