This topic describes how to try out parallel import and data compression.
Parallel import
In addition to analytical queries, another important part of operational online analytical processing (OLAP) lies in the parallel import of massive amounts of data, namely the batch data processing capability. The parallel execution framework of OceanBase Database allows you to execute data manipulation language (DML) statements in parallel, which is known as parallel DML (PDML) operations. It supports concurrent data writes to multiple servers in a multi-node database without compromising data consistency for large transactions. The asynchronous minor compaction mechanism enhances the support of the LSM-tree-based storage engine for large transactions when memory is tight.
This topic describes how to try out the PDML feature of OceanBase Database. First, create an empty table named lineitem2 with the same schema as that of the lineitem table used in running the TPC-H benchmark. Then, execute the INSERT INTO ...SELECT statement to insert 6 million rows of data of the lineitem table into the lineitem2 table. In the following examples, the statement is executed with PDML disabled and enabled respectively for comparison.
First, copy the schema of the lineitem table and create an empty table named lineitem2. Table partitioning is used to manage large tables in OceanBase Database. In this example, 16 partitions are used. Therefore, the lineitem2 table must also be split into 16 partitions:
obclient [test]> SHOW CREATE TABLE lineitem\G;
*************************** 1. row ***************************
Table: lineitem
Create Table: CREATE TABLE `lineitem` (
`l_orderkey` bigint(20) NOT NULL,
`l_partkey` bigint(20) NOT NULL,
`l_suppkey` bigint(20) NOT NULL,
`l_linenumber` bigint(20) NOT NULL,
`l_quantity` bigint(20) NOT NULL,
`l_extendedprice` bigint(20) NOT NULL,
`l_discount` bigint(20) NOT NULL,
`l_tax` bigint(20) NOT NULL,
`l_returnflag` char(1) DEFAULT NULL,
`l_linestatus` char(1) DEFAULT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date DEFAULT NULL,
`l_receiptdate` date DEFAULT NULL,
`l_shipinstruct` char(25) DEFAULT NULL,
`l_shipmode` char(10) DEFAULT NULL,
`l_comment` varchar(44) DEFAULT NULL,
PRIMARY KEY (`l_orderkey`, `l_linenumber`),
KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
partition by key(l_orderkey)
(partition p0,
partition p1,
partition p2,
partition p3,
partition p4,
partition p5,
partition p6,
partition p7,
partition p8,
partition p9,
partition p10,
partition p11,
partition p12,
partition p13,
partition p14,
partition p15)
1 row in set
Execute the statement with PDML disabled
PDML is disabled by default. After you create the lineitem2 table, execute the statement with PDML disabled. Because this is a large transaction involving 6 million rows of data, you must increase the default transaction timeout interval (in μs) of OceanBase Database:
# SET ob_query_timeout = 1000000000;
# SET ob_trx_timeout = 1000000000;
Insert data. The execution result is as follows:
obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215 Duplicates: 0 Warnings: 0

You can see that OceanBase Database takes 107s to insert 6 million rows of data in a single transaction when PDML is disabled.
Execute the statement with PDML enabled
Add a hint to the statement to enable PDML. Before you execute the statement again, clear the data inserted last time.
obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Execution result:
obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)
obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215 Duplicates: 0 Warnings: 0

You can see that the time required by OceanBase Database to insert 6 million rows of data into the same table is reduced to about 22 seconds when PDML is enabled. After the PDML feature is enabled, the database performance is about 5 times the performance when this feature is disabled. This feature can be useful in scenarios where you want to process data in batches.
Data compression
OceanBase Database has its own storage engine based on the LSM-tree structure. Data is roughly classified into baseline data and incremental data. Baseline data is stored in SSTables on the disk, and incremental data is stored in MemTables in the memory. In this way, data can be stored on the disk more compactly. In addition, baseline data on the disk is not frequently updated, and OceanBase Database re-compresses baseline data based on a general compression algorithm. Therefore, data stored in OceanBase Database has a high compression ratio. However, data compression does not degrade the query or write performance. This section describes how to import a large amount of data to OceanBase Database and check the data compression ratio.
Generate data
First, use the CreateData.jar tool to generate 50 million rows of test data in the /home/soft directory. It takes about 10 minutes to generate the data. You can also use other tools to generate test data.
#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G t_bigdata_loader.unl
OceanBase Database allows you to import data from CSV files in multiple ways. This section describes how to run the LOAD DATA command to import data.
First, name the generated file and check the actual file size.
# mv t_bigdata_loader.unl t_f1.csv # du -sh t_f1.csv 10G t_f1.csvThe following example shows the content of the test file
t_f1.csvgenerated by the tool by using a random algorithm. The file has eight data columns of different data types. Therefore, when you try out the data compression feature of OceanBase Database, you must first create a table in the tenant and import the records in the CSV file into the table.1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol| 2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ| 3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx| 4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc| ...Create a table named
t_f1in thetestdatabase of thetesttenant. For more information about how to create a tenant, see "Manage tenants."# obclient -hxxx.xxx.xxx.xxx -P2881 -uroot@test -Dtest -A -p -c obclient [test]> CREATE TABLE t_f1(id DECIMAL(10,0),id2 DECIMAL(10,0),id3 DATE,id4 DATE,id5 FLOAT,id6 FLOAT,id7 VARCHAR(30),id8 VARCHAR(300));
Import data
Run the built-in LOAD DATA command of OceanBase Database to import data. The LOAD DATA command also supports parallel import. Before you import data, complete the following settings. The LOAD DATA command allows you to import data only from a local OBServer node. If you want to import remote data, you can use the OBLOADER tool of OceanBase Database.
obclient [test]> SET GLOBAL secure_file_priv = "";
obclient [test]> GRANT FILE ON *.* to username;
Notice
For security reasons, the SQL statement for modifying secure_file_priv can only be executed locally and cannot be executed on a remote OBClient. In other words, you must log on to an OBClient (or a MySQL client) on the server where the observer process is located to execute it. For more information, see secure_file_priv.
After you complete the settings, reconnect to the session to make the settings take effect. Then, set the transaction timeout period for the session to ensure that the execution does not exit due to timeout.
obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;
Execute the import statement:
obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';
You can see that OceanBase Database takes about 4 minutes to import 10 GB of data when parallel import is enabled. In this example, the tenant is configured with 16 CPU cores. You can set an appropriate degree of parallelism (DOP) based on your configurations. A higher DOP indicates faster data import.
After the data is imported, check the number of records in the table and the space occupied by the table.
The table contains 50 million records.
obclient [test]> SELECT COUNT(*) FROM t_f1; +----------+ | count(*) | +----------+ | 50000000 | +----------+Trigger a major compaction.
To check the compression result of baseline data, log on as the
rootuser to thesystenant of the cluster and initiate a major compaction to compact and compress the incremental data and baseline data. You can execute the following SQL statement to initiate a major compaction:# obclient -h127.0.0.1 -P2881 -uroot@sys -Doceanbase -A -p -c obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;Execute the following SQL statement to query the state of the major compaction. If the statement returns
IDLE, the major compaction is completed.obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | TENANT_ID | FROZEN_SCN | FROZEN_TIME | GLOBAL_BROADCAST_SCN | LAST_SCN | LAST_FINISH_TIME | START_TIME | STATUS | IS_ERROR | IS_SUSPENDED | INFO | +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | 1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 | 1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE | NO | NO | | | 1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 | 1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE | NO | NO | | | 1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 | 1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE | NO | NO | | 1 row in setExecute the following SQL statement in the
systenant to query the space occupied by the imported data:obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,DBA_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1'; +------------+---------------+----------------------------+ | table_name | svr_ip | a.data_size/1024/1024/1024 | +------------+---------------+----------------------------+ | t_f1 | xxx.xx.xxx.xx | 6.144531250000 | +------------+---------------+----------------------------+
The compressed table is about 6.145 GB in size, and the compression ratio (uncompressed size divided by compressed size) is 10/6.145, namely 1.62.