This topic provides best practices for performing large-scale DML operations (INSERT, UPDATE, and DELETE) in OceanBase Database. It focuses on performance optimization strategies for using PDML and provides recommendations for database configuration. Parallel Data Manipulation Language (PDML) is a high-performance data processing technology provided by OceanBase Database, primarily used to accelerate large-scale INSERT, UPDATE, and DELETE operations. PDML improves data operation throughput and efficiency by decomposing and distributing data manipulation tasks across multiple parallel execution units.
How PDML works
PDML improves performance through the following mechanisms:
- Task parallelism: Breaks down large DML operations into multiple independent subtasks
- Data partition processing: For partitioned tables, each partition can be processed in parallel by different execution threads
- Resource optimization: Makes full use of multi-core CPUs and distributed computing resources
PDML use cases
- Large-scale data operations (batch processing jobs)
- Data warehouse ETL processes
- Archiving or purging historical data
- Large-scale data initialization loading
Enable PDML
OceanBase Database allows you to enable PDML in the following ways:
Specify the degree of parallelism by using SQL hints
You can add hints to an SQL statement to specify the degree of parallelism.
DELETE /*+ enable_parallel_dml parallel(4) */ FROM table_name WHERE condition; UPDATE /*+ enable_parallel_dml parallel(8) */ table_name SET column_name = value WHERE condition; INSERT /*+ enable_parallel_dml parallel(6) */ INTO target_table SELECT * FROM source_table WHERE condition;Here,
enable_parallel_dmlis used to enable PDML, andparallel(N)is used to specify the degree of parallelism.Scenarios: This method is suitable for temporarily optimizing the performance of a single SQL statement without affecting other operations. For example, you can use this method to quickly optimize a specific query or data modification operation.
Specify the degree of parallelism at the session level
You can set the degree of parallelism at the session level, which affects all subsequent DML operations in the current session.
In MySQL-compatible mode:
SET _force_parallel_query_dop = 3; SET _force_parallel_dml_dop = 3; -- DML statements after this will automatically apply parallel processing. DELETE FROM table_name WHERE column_name = 'EXPIRED';In Oracle-compatible mode:
ALTER SESSION FORCE PARALLEL QUERY PARALLEL 3; ALTER SESSION FORCE PARALLEL DML PARALLEL 3; -- DML statements after this will automatically apply parallel processing. DELETE FROM table_name WHERE column_name = 'EXPIRED'; -- Disable parallel processing at the session level. ALTER SESSION DISABLE PARALLEL QUERY;Scenarios: This method is suitable for batch processing jobs that need to perform multiple parallel DML operations within a single session. For example, during ETL processes, you can perform multiple data conversion and loading operations within the same session.
Enable auto DOP
You can enable auto DOP so that the database automatically adjusts the degree of parallelism based on system resources and workload.
/* Enable auto DOP at the GLOBAL level */ SET global parallel_degree_policy = AUTO; /* Enable auto DOP at the SESSION level */ SET session parallel_degree_policy = AUTO; SET parallel_degree_policy = AUTO; /* Enable auto DOP at the query level by using a hint */ SELECT /*+PARALLEL(AUTO)*/ * FROM ...;Scenarios: This method is suitable for scenarios where the database automatically optimizes the degree of parallelism without manual specification. For example, when resources are sufficient, the database can automatically increase the degree of parallelism to speed up queries.
Specify the degree of parallelism at the table level
You can specify the default degree of parallelism when you create or modify a table.
CREATE TABLE sales_history ( id INT PRIMARY KEY, sale_date DATE, amount DECIMAL(10,2) ) parallel 8; -- Modify an existing table. ALTER TABLE customer_data parallel 6;When you perform DML operations on a table with a specified degree of parallelism and do not explicitly disable parallel processing, OceanBase Database will automatically execute those operations in parallel (PDML).
Scenarios: This method is suitable for large tables that frequently require batch processing, as it can significantly improve DML performance over the long term. For example, for a large log table that often requires data cleansing and transformation, you can set the table-level degree of parallelism to enhance processing efficiency.
The parallelism modes are prioritized in the following order:
- Hint-based parallelism at the table level
- Hint-based parallelism at the global level
- Session-level parallelism
- Table-level parallelism
Use case: Data archiving
A platform generates a large amount of order data every day. As time goes by, the volume of historical order data increases, which leads to a decline in query performance. To improve query efficiency and reduce storage costs, the platform decides to archive order data older than one year into a history table.
Challenges
- Large data volume: Millions of orders need to be archived daily.
- Limited time window: The archiving operation needs to be completed during off-peak hours, leaving a very narrow window.
- Minimal business impact: The archiving process must not significantly affect online operations.
Solution
Use the PDML feature of OceanBase Database to execute data archiving in parallel.
Create a historical table: Create a historical table named
order_historywith the same schema as the orders table to store the archived data.CREATE TABLE order_history LIKE orders;Enable PDML: Set table-level parallelism to enable PDML for the orders table.
ALTER TABLE orders parallel 8;Perform data archiving: Use the
INSERT INTO ... SELECTstatement to migrate historical data from the orders table to the history table, and use theDELETEstatement to delete the historical data from the orders table.INSERT /*+ enable_parallel_dml parallel(8) */ INTO order_history SELECT * FROM orders WHERE order_date < DATE_SUB(CURDATE(), INTERVAL 1 YEAR); DELETE /*+ enable_parallel_dml parallel(8) */ FROM orders WHERE order_date < DATE_SUB(CURDATE(), INTERVAL 1 YEAR);Monitor the archiving progress: Use the monitoring tool of OceanBase Database to track PDML execution progress and resource usage.
PDML usage limitations
Use batch optimization
- Scenario: Use batch DML (such as multi-queries or array binding)
- Cause: Enabling PDML and batch DML simultaneously can lead to lock conflicts and transaction isolation issues.
Unsupported DML statement types
- Scenario: Only INSERT, DELETE, UPDATE, and MERGE operations are supported.
INSERT FROM SELECT
- Scenario: The values to be inserted are obtained from the SELECT clause.
Auto-increment columns for primary keys or partitioning keys
- Scenario: The target table contains an auto-increment column for a primary key or a partitioning key.
- Cause: PDML cannot guarantee monotonicity of auto-increment columns, so it does not support this feature.
IGNORE keyword
- Scenario: DML statements that contain the IGNORE keyword (such as INSERT IGNORE or DELETE IGNORE)
- Cause: The IGNORE keyword ignores errors such as unique key conflicts, but parallel data processing in PDML can cause data consistency issues.
Multi-table operations
- Scenario: DML operations involving multiple tables, such as JOIN or subquery updates
- Cause: PDML only supports single-table operations. Multi-table operations require distributed transaction coordination, which is complex.
ON DUPLICATE KEY UPDATE
- Scenario: The INSERT statement contains ON DUPLICATE KEY UPDATE (that is, INSERT_UP)
- Condition: The target table has a unique index or global index, or the primary key or unique key needs to be updated
- Cause: Unique key conflict detection and updates require atomicity. Parallel execution in PDML may lead to inconsistent conflict detection.
Explicitly disable PDML for DML statements
- Scenario: DML statements are marked for disabling PDML through optimizer rules or conversion logic.
- Cause: Some optimization rules, such as triggers or foreign key constraints, rely on serial execution logic.
Cross-database operations
- Scenario: DML operations involve DBLinks across databases.
- Cause: Cross-database transactions require coordination with external resources, and parallel execution of PDML may introduce consistency risks.
Nested SQL
- Scenario: DML statements contain foreign key constraints or triggers
- Cause: The execution of foreign key constraints and triggers depends on the relationships between tables. Parallel execution of PDML may lead to data consistency issues.
Partition-specific operations
- Scenario: A DELETE statement specifies particular partitions by using the PARTITION clause, such as DELETE FROM t1 PARTITION(p0)
- Cause: The parallel execution after partition pruning may conflict with the partition scheduling strategy of PDML
Non-online DDL operations
- Scenarios:
- DML operations performed during non-online DDL operations (such as index creation or table schema changes)
- UPDATE statements that specify partitions
- Modifying the ON UPDATE CURRENT_TIMESTAMP column
- Cause: PDML may interfere with internal state synchronization and cause timestamp inconsistencies
Variable assignment
- Scenario: Statements contain variable assignment operations, especially when these assignments appear in subqueries or nested statements.
- Cause: Parallel execution in PDML requires statements to be deterministic (without side effects). Variable assignments may introduce runtime state dependencies (such as cross-shard variable modifications), which can compromise the atomicity and consistency of parallel execution.
Orthogonal scenarios
- Scenarios:
- Foreign key constraints
- Triggers
- SQL statements containing PL/UDFs
HINT disabled
- Scenario: PDML is disabled by using hints
Use batch optimization
- Scenario: Use batch DML (such as multi-queries or array binding)
- Cause: Enabling PDML and batch DML simultaneously can result in lock conflicts and transaction isolation issues.
Unsupported DML statement types
- Scenario: Only INSERT,DELETE, UPDATE, and MERGE operations are supported.
INSERT FROM SELECT
- Scenario: The values to be inserted are obtained from the SELECT clause.
Auto-increment columns for primary keys or partitioning keys
- Scenario: The target table contains an auto-increment column for a primary key or a partitioning key.
- Cause: PDML cannot guarantee the monotonicity of auto-increment columns, so it does not support this feature.
IGNORE keyword
- Scenario: DML statements that contain the IGNORE keyword, such as INSERT IGNORE or DELETE IGNORE.
- Cause: The IGNORE keyword ignores errors such as unique key conflicts, but parallel execution of PDML statements can lead to data consistency issues.
Multi-table operations
- Scenario: DML operations involving multiple tables (such as JOIN or subquery updates)
- Cause: PDML only supports single-table operations. Multi-table operations require distributed transaction coordination, which is complex.
ON DUPLICATE KEY UPDATE
- Scenario: The INSERT statement contains ON DUPLICATE KEY UPDATE (i.e., INSERT_UP)
- Cause: Unique key conflict detection and updates require atomicity, and parallel execution of PDML may lead to inconsistent conflict detection.
Explicitly disable PDML for DML statements
- Scenario: DML statements are marked for disabling PDML through optimizer rules or conversion logic.
- Cause: Certain optimization rules, such as triggers or foreign key constraints, rely on serial execution logic.
Cross-database operations
- Scenario: DML operations involve DBLinks across databases
- Cause: Cross-database transactions require coordination with external resources, and parallel execution of PDML may introduce consistency risks
Nested SQL statements
- Scenario: DML statements contain foreign key constraints or triggers.
- Cause: The execution of foreign key constraints and triggers depends on the relationships between tables. Parallel execution of PDML may lead to data consistency issues.
Specify partitions
- Scenario: A DELETE statement specifies specific partitions by using the PARTITION clause (such as DELETE FROM t1 PARTITION(p0))
- Cause: The parallel execution after partition pruning may conflict with the partition scheduling strategy of PDML
Non-online DDL operations
- Scenarios:
- DML operations during non-online DDL operations (such as index creation or table schema changes)
- UPDATE statements that specify partitions
- Modifying the ON UPDATE CURRENT_TIMESTAMP column
- Reasons: PDML may interfere with internal state synchronization and cause timestamp inconsistencies.
Variable assignment
- Scenario: The operation involves variable assignment, and the variable assignment occurs within a subquery or nested statement.
- Cause: Parallel execution of PDML requires statements to be deterministic (with no side effects). Variable assignment can introduce runtime state dependencies (such as variable modifications across shards), which can compromise the atomicity and consistency of parallel execution.
Orthogonal scenarios
- Scenario:
- Foreign key constraints
- Triggers
- SQL statements contain PL UDFs
HINT disabled
- Scenario: Disable PDML by using hints
References
- Set parallel execution parameters
- Perform parameter tuning for PX
- Set the degree of parallelism for parallel execution
- Enable and disable parallel queries
- Parallel execution methods and their priorities
- Classification and optimization of parallel execution
- Best practices for optimizing batch DML performance
- Best practices for creating indexes on large tables
- Best practices for handling slow queries
- Diagnose parallel execution issues