Best practices for PDML processing in OceanBase Database-OceanBase

This topic provides best practices for performing large-scale DML operations (INSERT, UPDATE, and DELETE) in OceanBase Database. It focuses on performance optimization strategies for using PDML and provides recommendations for database configuration. Parallel Data Manipulation Language (PDML) is a high-performance data processing technology provided by OceanBase Database, primarily used to accelerate large-scale INSERT, UPDATE, and DELETE operations. PDML improves data operation throughput and efficiency by decomposing and distributing data manipulation tasks across multiple parallel execution units.

How PDML works

PDML improves performance through the following mechanisms:

Task parallelism: Breaks down large DML operations into multiple independent subtasks
Data partition processing: For partitioned tables, each partition can be processed in parallel by different execution threads
Resource optimization: Makes full use of multi-core CPUs and distributed computing resources

PDML use cases

Large-scale data operations (batch processing jobs)
Data warehouse ETL processes
Archiving or purging historical data
Large-scale data initialization loading

Enable PDML

OceanBase Database allows you to enable PDML in the following ways:

Specify the degree of parallelism by using SQL hints

You can add hints to an SQL statement to specify the degree of parallelism.
```
DELETE /*+ enable_parallel_dml parallel(4) */ FROM table_name WHERE condition;

UPDATE /*+ enable_parallel_dml parallel(8) */ table_name SET column_name = value WHERE condition;

INSERT /*+ enable_parallel_dml parallel(6) */ INTO target_table SELECT * FROM source_table WHERE condition;
```
Here, enable_parallel_dml is used to enable PDML, and parallel(N) is used to specify the degree of parallelism.

Scenarios: This method is suitable for temporarily optimizing the performance of a single SQL statement without affecting other operations. For example, you can use this method to quickly optimize a specific query or data modification operation.

Specify the degree of parallelism at the session level

You can set the degree of parallelism at the session level, which affects all subsequent DML operations in the current session.

In MySQL-compatible mode:

SET _force_parallel_query_dop = 3;
SET _force_parallel_dml_dop = 3;

-- DML statements after this will automatically apply parallel processing.
DELETE FROM table_name WHERE column_name = 'EXPIRED';

In Oracle-compatible mode:

ALTER SESSION FORCE PARALLEL QUERY PARALLEL 3;
ALTER SESSION FORCE PARALLEL DML PARALLEL 3;

-- DML statements after this will automatically apply parallel processing.
DELETE FROM table_name WHERE column_name = 'EXPIRED';

-- Disable parallel processing at the session level.
ALTER SESSION DISABLE PARALLEL QUERY;

Scenarios: This method is suitable for batch processing jobs that need to perform multiple parallel DML operations within a single session. For example, during ETL processes, you can perform multiple data conversion and loading operations within the same session.

Enable auto DOP

You can enable auto DOP so that the database automatically adjusts the degree of parallelism based on system resources and workload.
```
/* Enable auto DOP at the GLOBAL level */
SET global parallel_degree_policy = AUTO;

/* Enable auto DOP at the SESSION level */
SET session parallel_degree_policy = AUTO;
SET parallel_degree_policy = AUTO;

/* Enable auto DOP at the query level by using a hint */
SELECT /*+PARALLEL(AUTO)*/ * FROM ...;
```
Scenarios: This method is suitable for scenarios where the database automatically optimizes the degree of parallelism without manual specification. For example, when resources are sufficient, the database can automatically increase the degree of parallelism to speed up queries.
Specify the degree of parallelism at the table level

You can specify the default degree of parallelism when you create or modify a table.
```
CREATE TABLE sales_history (
id INT PRIMARY KEY,
sale_date DATE,
amount DECIMAL(10,2)
) parallel 8;

-- Modify an existing table.
ALTER TABLE customer_data parallel 6;
```
When you perform DML operations on a table with a specified degree of parallelism and do not explicitly disable parallel processing, OceanBase Database will automatically execute those operations in parallel (PDML).

Scenarios: This method is suitable for large tables that frequently require batch processing, as it can significantly improve DML performance over the long term. For example, for a large log table that often requires data cleansing and transformation, you can set the table-level degree of parallelism to enhance processing efficiency.

The parallelism modes are prioritized in the following order:

Hint-based parallelism at the table level
Hint-based parallelism at the global level
Session-level parallelism
Table-level parallelism

Use case: Data archiving

A platform generates a large amount of order data every day. As time goes by, the volume of historical order data increases, which leads to a decline in query performance. To improve query efficiency and reduce storage costs, the platform decides to archive order data older than one year into a history table.

Challenges

Large data volume: Millions of orders need to be archived daily.
Limited time window: The archiving operation needs to be completed during off-peak hours, leaving a very narrow window.
Minimal business impact: The archiving process must not significantly affect online operations.

Solution

Use the PDML feature of OceanBase Database to execute data archiving in parallel.

Create a historical table: Create a historical table named order_history with the same schema as the orders table to store the archived data.
```
CREATE TABLE order_history LIKE orders;
```
Enable PDML: Set table-level parallelism to enable PDML for the orders table.
```
ALTER TABLE orders parallel 8;
```

Perform data archiving: Use the INSERT INTO ... SELECT statement to migrate historical data from the orders table to the history table, and use the DELETE statement to delete the historical data from the orders table.

INSERT /*+ enable_parallel_dml parallel(8) */ INTO order_history
SELECT * FROM orders WHERE order_date < DATE_SUB(CURDATE(), INTERVAL 1 YEAR);

DELETE /*+ enable_parallel_dml parallel(8) */ FROM orders
WHERE order_date < DATE_SUB(CURDATE(), INTERVAL 1 YEAR);

Monitor the archiving progress: Use the monitoring tool of OceanBase Database to track PDML execution progress and resource usage.

PDML usage limitations

V4.3.5

V4.2.5

Use batch optimization

Scenario: Use batch DML (such as multi-queries or array binding)
Cause: Enabling PDML and batch DML simultaneously can lead to lock conflicts and transaction isolation issues.

Unsupported DML statement types

Scenario: Only INSERT, DELETE, UPDATE, and MERGE operations are supported.

INSERT FROM SELECT

Scenario: The values to be inserted are obtained from the SELECT clause.

Auto-increment columns for primary keys or partitioning keys

Scenario: The target table contains an auto-increment column for a primary key or a partitioning key.
Cause: PDML cannot guarantee monotonicity of auto-increment columns, so it does not support this feature.

IGNORE keyword

Scenario: DML statements that contain the IGNORE keyword (such as INSERT IGNORE or DELETE IGNORE)
Cause: The IGNORE keyword ignores errors such as unique key conflicts, but parallel data processing in PDML can cause data consistency issues.

Multi-table operations

Scenario: DML operations involving multiple tables, such as JOIN or subquery updates
Cause: PDML only supports single-table operations. Multi-table operations require distributed transaction coordination, which is complex.

ON DUPLICATE KEY UPDATE

Scenario: The INSERT statement contains ON DUPLICATE KEY UPDATE (that is, INSERT_UP)
Condition: The target table has a unique index or global index, or the primary key or unique key needs to be updated
Cause: Unique key conflict detection and updates require atomicity. Parallel execution in PDML may lead to inconsistent conflict detection.

Explicitly disable PDML for DML statements

Scenario: DML statements are marked for disabling PDML through optimizer rules or conversion logic.
Cause: Some optimization rules, such as triggers or foreign key constraints, rely on serial execution logic.

Cross-database operations

Scenario: DML operations involve DBLinks across databases.
Cause: Cross-database transactions require coordination with external resources, and parallel execution of PDML may introduce consistency risks.

Nested SQL

Scenario: DML statements contain foreign key constraints or triggers
Cause: The execution of foreign key constraints and triggers depends on the relationships between tables. Parallel execution of PDML may lead to data consistency issues.

Partition-specific operations

Scenario: A DELETE statement specifies particular partitions by using the PARTITION clause, such as DELETE FROM t1 PARTITION(p0)
Cause: The parallel execution after partition pruning may conflict with the partition scheduling strategy of PDML

Non-online DDL operations

Scenarios:
DML operations performed during non-online DDL operations (such as index creation or table schema changes)
UPDATE statements that specify partitions
Modifying the ON UPDATE CURRENT_TIMESTAMP column
Cause: PDML may interfere with internal state synchronization and cause timestamp inconsistencies

Variable assignment

Scenario: Statements contain variable assignment operations, especially when these assignments appear in subqueries or nested statements.
Cause: Parallel execution in PDML requires statements to be deterministic (without side effects). Variable assignments may introduce runtime state dependencies (such as cross-shard variable modifications), which can compromise the atomicity and consistency of parallel execution.

Orthogonal scenarios

Scenarios:
Foreign key constraints
Triggers
SQL statements containing PL/UDFs

HINT disabled

Scenario: PDML is disabled by using hints

Use batch optimization

Scenario: Use batch DML (such as multi-queries or array binding)
Cause: Enabling PDML and batch DML simultaneously can result in lock conflicts and transaction isolation issues.

Unsupported DML statement types

Scenario: Only INSERT,DELETE, UPDATE, and MERGE operations are supported.

INSERT FROM SELECT

Scenario: The values to be inserted are obtained from the SELECT clause.

Auto-increment columns for primary keys or partitioning keys

Scenario: The target table contains an auto-increment column for a primary key or a partitioning key.
Cause: PDML cannot guarantee the monotonicity of auto-increment columns, so it does not support this feature.

IGNORE keyword

Scenario: DML statements that contain the IGNORE keyword, such as INSERT IGNORE or DELETE IGNORE.
Cause: The IGNORE keyword ignores errors such as unique key conflicts, but parallel execution of PDML statements can lead to data consistency issues.

Multi-table operations

Scenario: DML operations involving multiple tables (such as JOIN or subquery updates)
Cause: PDML only supports single-table operations. Multi-table operations require distributed transaction coordination, which is complex.

ON DUPLICATE KEY UPDATE

Scenario: The INSERT statement contains ON DUPLICATE KEY UPDATE (i.e., INSERT_UP)
Cause: Unique key conflict detection and updates require atomicity, and parallel execution of PDML may lead to inconsistent conflict detection.

Explicitly disable PDML for DML statements

Scenario: DML statements are marked for disabling PDML through optimizer rules or conversion logic.
Cause: Certain optimization rules, such as triggers or foreign key constraints, rely on serial execution logic.

Cross-database operations

Scenario: DML operations involve DBLinks across databases
Cause: Cross-database transactions require coordination with external resources, and parallel execution of PDML may introduce consistency risks

Nested SQL statements

Scenario: DML statements contain foreign key constraints or triggers.
Cause: The execution of foreign key constraints and triggers depends on the relationships between tables. Parallel execution of PDML may lead to data consistency issues.

Specify partitions

Scenario: A DELETE statement specifies specific partitions by using the PARTITION clause (such as DELETE FROM t1 PARTITION(p0))
Cause: The parallel execution after partition pruning may conflict with the partition scheduling strategy of PDML

Non-online DDL operations

Scenarios:
DML operations during non-online DDL operations (such as index creation or table schema changes)
UPDATE statements that specify partitions
Modifying the ON UPDATE CURRENT_TIMESTAMP column
Reasons: PDML may interfere with internal state synchronization and cause timestamp inconsistencies.

Variable assignment

Scenario: The operation involves variable assignment, and the variable assignment occurs within a subquery or nested statement.
Cause: Parallel execution of PDML requires statements to be deterministic (with no side effects). Variable assignment can introduce runtime state dependencies (such as variable modifications across shards), which can compromise the atomicity and consistency of parallel execution.

Best practices for PDML processing in OceanBase Database

How PDML works

PDML use cases

Enable PDML

Use case: Data archiving

Challenges

Solution

PDML usage limitations

Use batch optimization

Unsupported DML statement types

INSERT FROM SELECT

Auto-increment columns for primary keys or partitioning keys

IGNORE keyword

Multi-table operations

ON DUPLICATE KEY UPDATE

Explicitly disable PDML for DML statements

Cross-database operations

Nested SQL

Partition-specific operations

Non-online DDL operations

Variable assignment

Orthogonal scenarios

HINT disabled

Use batch optimization

Unsupported DML statement types

INSERT FROM SELECT

Auto-increment columns for primary keys or partitioning keys

IGNORE keyword

Multi-table operations

ON DUPLICATE KEY UPDATE

Explicitly disable PDML for DML statements

Cross-database operations

Nested SQL statements

Specify partitions

Non-online DDL operations

Variable assignment

Orthogonal scenarios

HINT disabled

References