Try out parallel import and data compression|V4.2.2| docs|Distributed Database

Try out parallel import and data compression

Last Updated：2025-12-02 02:50:56 Updated

This topic describes parallel import and data compression.

Parallel import

In addition to analytical queries, another crucial aspect of Operational OLAP is the parallel import of large volumes of data, also known as batch processing capability. The parallel execution framework of the OceanBase database can execute DML statements concurrently (Parallel DML), enabling multi-machine concurrent writes for multi-node databases while ensuring consistency for large transactions. Combined with an asynchronous flushing mechanism, it can significantly optimize the support for large transactions in the LSM-Tree storage engine when memory is constrained.

Try out PDML with this example: Based on the lineitem table of the TPC-H benchmark test, create a table named lineitem2 with the same schema. Then, execute an INSERT INTO ... SELECT statement to insert all 6 million rows from the lineitem table to the lineitem2 table. Perform the following steps to observe the results with PDML enabled and compare them with those without PDML enabled.

First, copy the schema of the lineitem table to create the lineitem2 table. Note that in OceanBase Database, you can scale out a table by partition. In this example, the lineitem2 table also has 16 partitions:

obclient [test]>  SHOW CREATE TABLE lineitem\G
*************************** 1. row ***************************
       Table: lineitem

Create Table: CREATE TABLE `lineitem` (
  `l_orderkey` bigint(20) NOT NULL,
  `l_partkey` bigint(20) NOT NULL,
  `l_suppkey` bigint(20) NOT NULL,
  `l_linenumber` bigint(20) NOT NULL,
  `l_quantity` bigint(20) NOT NULL,
  `l_extendedprice` bigint(20) NOT NULL,
  `l_discount` bigint(20) NOT NULL,
  `l_tax` bigint(20) NOT NULL,
  `l_returnflag` char(1) DEFAULT NULL,
  `l_linestatus` char(1) DEFAULT NULL,
  `l_shipdate` date NOT NULL,
  `l_commitdate` date DEFAULT NULL,
  `l_receiptdate` date DEFAULT NULL,
  `l_shipinstruct` char(25) DEFAULT NULL,
  `l_shipmode` char(10) DEFAULT NULL,
  `l_comment` varchar(44) DEFAULT NULL,
  PRIMARY KEY (`l_orderkey`, `l_linenumber`),
  KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
  KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
 partition by key(l_orderkey)
  (partition p0,
  partition p1,
  partition p2,
  partition p3,
  partition p4,
  partition p5,
  partition p6,
  partition p7,
  partition p8,
  partition p9,
  partition p10,
  partition p11,
  partition p12,
  partition p13,
  partition p14,
  partition p15)
1 row in set

Perform the insertion without enabling PDML

After creating the lineitem2 table, insert the data without enabling PDML. Since this is a large transaction involving 6 million rows, increase the default transaction timeout period of OceanBase Database (in μs):

# SET ob_query_timeout = 1000000000;  
# SET ob_trx_timeout = 1000000000;

The execution result is as follows:

obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215  Duplicates: 0  Warnings: 0

As we can see, it takes 107s for OceanBase Database to insert 6 million rows in a transaction when PDML is disabled.

Enable PDML

Add a hint to enable PDML. Note that before you insert the data again, clear the previously inserted data.

obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;

Check the execution time:

obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)

obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215  Duplicates: 0  Warnings: 0

As we can see, with PDML enabled, OceanBase Database inserts 6 million rows in about 22s. PDML improves the performance by about 5 times. This feature helps users who need to process large volumes of batch data.

Data compression

OceanBase Database has developed its own storage engine based on the LSM-tree structure. In this architecture, data is stored in baseline data files (SSTables) on disk and incremental data changes are stored in memory (MemTables). This setup allows OceanBase Database to store data on disk in a more compact format. In addition, since the baseline data on disk is not frequently updated, OceanBase Database uses general compression algorithms to further compress the baseline data, achieving a high compression ratio. This approach does not compromise query and write performance. The following describes how to import a large amount of external data to OceanBase Database and observe the compression ratio.

Data preparation

First, use the data preparation tool CreateData.jar to generate 50 million rows of simulated data to the /home/soft directory. The data generation process may take several minutes. You can also use other tools to generate test data.

#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G     t_bigdata_loader.unl

OceanBase Database allows you to import data in the CSV format to the database through multiple methods. This topic introduces the method of using the Load Data statement.

Name the generated files and verify their sizes.

# mv t_bigdata_loader.unl t_f1.csv
# du -sh t_f1.csv
10G     t_f1.csv

The t_f1.csv file contains 8 columns of data, which correspond to different data types, based on the random algorithm. Therefore, create a table in the tenant, and import the data in the t_f1.csv file to the table.

1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol|
2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ|
3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx|
4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc|
....

In the test tenant, create a database named test, and create a table named t_f1 in the database.

Data import

You can use the built-in Load Data command of OceanBase Database to import data. The Load Data command supports parallel import. Before you import data, perform the following setup. The Load Data command supports importing data files only from the local disk of the OBServer node. If you want to import data remotely, you can use the obloader tool of OceanBase Database.

obclient [test]> SET GLOBAL secure_file_priv = "/";
obclient [test]> GRANT FILE ON *.* to username;

Notice

For security reasons, you can execute the SQL statement that modifies the secure_file_priv parameter only from a local client. For more information, see secure_file_priv.

After the setup is complete, reconnect the session to apply the changes. Set the transaction timeout period of the session again to ensure that the session does not time out during the execution.

obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;

Then, run the import statement:

obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';

As you can see, with parallel import enabled, it takes about 4 minutes to import 10 GB of data. The CPU parallelism for the tenant in this topic is 16. You can set an appropriate parallelism based on your configuration. Higher parallelism leads to faster import.

After the import is complete, query the number of records and the occupied space size in the table.

The query result shows that the table contains 50 million records.

obclient [test]> SELECT COUNT(*) FROM t_f1;
+----------+
| count(*) |
+----------+
| 50000000 |
+----------+

Initiate a major compaction.

To view the compression effect of the baseline data, log in to the sys tenant of the cluster as the root user and initiate a major compaction so that the incremental data can be merged with the baseline data and compressed. You can also manually trigger a major compaction by using the following method.
```
# obclient -h127.0.0.1 -P2881 -uroot@sys  -Doceanbase -A -p -c
obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;
```

Query the state of the major compaction. When the result returns IDLE, the major compaction is complete.

obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION;
+-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+
| TENANT_ID | FROZEN_SCN          | FROZEN_TIME                | GLOBAL_BROADCAST_SCN | LAST_SCN            | LAST_FINISH_TIME           | START_TIME                 | STATUS | IS_ERROR | IS_SUSPENDED | INFO |
+-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+
|         1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 |  1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE   | NO       | NO           |      |
|      1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 |  1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE   | NO       | NO           |      |
|      1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 |  1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE   | NO       | NO           |      |
1 row in set

Query the following statement from the sys tenant to view the data storage usage after the data is imported to OceanBase Database.

obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,CDB_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1';
+------------+---------------+----------------------------+
| table_name | svr_ip        | a.data_size/1024/1024/1024 |
+------------+---------------+----------------------------+
| t_f1       | xxx.xx.xxx.xx |             6.144531250000 |
+------------+---------------+----------------------------+

The size of the compressed table is approximately 6.145 GB, and the compression ratio is 10/6.145 = 1.62.