Try out parallel import and data compression|V4.2.0| docs|Distributed Database

Try out parallel import and data compression

Last Updated：2025-12-04 05:37:52 Updated

This topic describes parallel import and data compression.

Parallel import

Apart from analytical queries, operational OLAP also involves the import of a large amount of data in parallel, or batch processing. The parallel execution framework of OceanBase Database allows DML statements to be executed in parallel (Parallel DML), enabling concurrent writes across servers and ensuring the consistency of large transactions. This mechanism significantly optimizes the performance of the LSM-tree-based storage engine when it handles large transactions under memory pressure.

To try out PDML, we use the lineitem table of the TPC-H benchmark test as an example. First, create an empty table named lineitem2 with the same schema as the lineitem table. Then, execute an INSERT INTO ... SELECT statement to insert all 6 million rows from the lineitem table to the lineitem2 table. We will run the PDML and no-PDML scenarios respectively and compare the results and performance.

First, copy the schema of the lineitem table to create the lineitem2 table. Note that in OceanBase Database, you can scale out data by partition for a table. In this example, the lineitem2 table also has 16 partitions:

obclient [test]>  SHOW CREATE TABLE lineitem\G;
*************************** 1. row ***************************
       Table: lineitem

Create Table: CREATE TABLE `lineitem` (
  `l_orderkey` bigint(20) NOT NULL,
  `l_partkey` bigint(20) NOT NULL,
  `l_suppkey` bigint(20) NOT NULL,
  `l_linenumber` bigint(20) NOT NULL,
  `l_quantity` bigint(20) NOT NULL,
  `l_extendedprice` bigint(20) NOT NULL,
  `l_discount` bigint(20) NOT NULL,
  `l_tax` bigint(20) NOT NULL,
  `l_returnflag` char(1) DEFAULT NULL,
  `l_linestatus` char(1) DEFAULT NULL,
  `l_shipdate` date NOT NULL,
  `l_commitdate` date DEFAULT NULL,
  `l_receiptdate` date DEFAULT NULL,
  `l_shipinstruct` char(25) DEFAULT NULL,
  `l_shipmode` char(10) DEFAULT NULL,
  `l_comment` varchar(44) DEFAULT NULL,
  PRIMARY KEY (`l_orderkey`, `l_linenumber`),
  KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
  KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
 partition by key(l_orderkey)
  (partition p0,
  partition p1,
  partition p2,
  partition p3,
  partition p4,
  partition p5,
  partition p6,
  partition p7,
  partition p8,
  partition p9,
  partition p10,
  partition p11,
  partition p12,
  partition p13,
  partition p14,
  partition p15)
1 row in set

Run the query without enabling PDML

After the lineitem2 table is created, insert the data into the table without enabling PDML. Since this is a large transaction involving 6 million rows, increase the default transaction timeout period of OceanBase Database to a larger value (in μs):

# SET ob_query_timeout = 1000000000;  
# SET ob_trx_timeout = 1000000000;

The execution result is as follows:

obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215  Duplicates: 0  Warnings: 0

As we can see, without enabling PDML, it takes OceanBase Database 107 seconds to insert 6 million rows in a single transaction.

Run the query with PDML enabled

Add a hint to enable PDML. Note that before you insert the data, clear the data inserted last time.

obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;

Check the execution time:

obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)

obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215  Duplicates: 0  Warnings: 0

As we can see, with PDML enabled, OceanBase Database takes about 22 seconds to insert 6 million rows into the lineitem2 table. The performance improvement is approximately 5 times with PDML enabled. This feature helps users who need to process a large amount of batch data.

Data compression

OceanBase Database has developed its own storage engine based on the LSM-tree structure. In this architecture, data is divided into baseline data (SSTable) and incremental data (MemTable). Baseline data is stored in the disk, while incremental modifications are tracked in the memory. This way, the data stored in the disk can be stored in a more compact format. In addition, since baseline data stored in the disk is not frequently updated, OceanBase Database applies general compression algorithms to baseline data to achieve a high compression ratio. As a result, OceanBase Database can achieve a high compression ratio for data stored in it without compromising query and write performance. The following describes how to import a large amount of external data to OceanBase Database and observe the compression ratio.

Data preparation

First, use the data preparation tool CreateData.jar to generate 50 million rows of simulated data to the /home/soft directory. The data generation process may take several minutes. You can also use other tools to generate test data.

#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G     t_bigdata_loader.unl

OceanBase Database allows you to import data in CSV format to the database through various methods. This topic introduces how to import data by using the load data command.

Name the generated files and verify their sizes.

# mv t_bigdata_loader.unl t_f1.csv
# du -sh t_f1.csv
10G     t_f1.csv

View the content of the t_f1.csv file. The file contains 8 columns of data, which are randomly generated and correspond to different data types. Therefore, before you explore the data compression feature of OceanBase Database, you need to create a table in the tenant and import the data in the CSV file to the table.

1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol|
2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ|
3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx|
4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc|
....

In the test tenant and test database, create a table named t_f1. For more information, see the topic on how to create a tenant.

Data import

You can use the Load Data statement in OceanBase Database to import data. The Load Data statement supports parallel import. Before you import data, perform the following setup. The Load Data statement supports only local data file import on the OBServer node. If you want to import data remotely, you can use the obloader tool.

obclient [test]> SET GLOBAL secure_file_priv = "/";
obclient [test]> GRANT FILE ON *.* to username;

Notice

For security reasons, you can execute the statement to modify the secure_file_priv parameter only through a local connection. For more information, see secure_file_priv.

After the setup is complete, reconnect the session to apply the changes. Then set the transaction timeout period of the session to ensure that the operation does not time out during execution.

obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;

Run the following statement to import data:

obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';

As you can see, with parallel import enabled, it takes about 4 minutes to import 10 GB of data. The CPU parallelism for the tenant in this topic is 16. You can set an appropriate parallelism level based on your configuration. Higher parallelism results in a faster import.

After the import is complete, query the number of records and the occupied space size in the table.

Query that the table contains 50 million records.

obclient [test]> SELECT COUNT(*) FROM t_f1;
+----------+
| count(*) |
+----------+
| 50000000 |
+----------+

Perform a major compaction.

To view the compression effect of the baseline data, log in to the sys tenant of the cluster as the root user and run the major compaction statement to compress the incremental data and baseline data. You can also manually trigger a compaction by using the following method.
```
# obclient -h127.0.0.1 -P2881 -uroot@sys  -Doceanbase -A -p -c
obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;
```

Query the state of the compaction. When the following query returns IDLE, the compaction is complete.

obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION;
+-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+
| TENANT_ID | FROZEN_SCN          | FROZEN_TIME                | GLOBAL_BROADCAST_SCN | LAST_SCN            | LAST_FINISH_TIME           | START_TIME                 | STATUS | IS_ERROR | IS_SUSPENDED | INFO |
+-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+
|         1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 |  1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE   | NO       | NO           |      |
|      1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 |  1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE   | NO       | NO           |      |
|      1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 |  1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE   | NO       | NO           |      |
1 row in set

Run the following statement in the sys tenant to view the data storage usage after the data is imported to OceanBase Database.

obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,CDB_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1';
+------------+---------------+----------------------------+
| table_name | svr_ip        | a.data_size/1024/1024/1024 |
+------------+---------------+----------------------------+
| t_f1       | xxx.xx.xxx.xx |             6.144531250000 |
+------------+---------------+----------------------------+

The size of the compressed table is approximately 6.145 GB, and the compression ratio is 10/6.145 = 1.62.