This topic describes parallel import and data compression.
Scenarios where they are commonly applied include the following:
Data migration: When you use parallel import to migrate a large amount of data between systems, data compression can significantly speed up the data migration process and reduce the required transmission bandwidth and storage resources.
Backup and restore: Data compression can reduce the size of backup files to save storage space. Parallel import can speed up the restore process.
Columnstore tables: During a query on a columnstore table, the system needs to access only the involved columns rather than the entire rows. Therefore, you can scan and process the compressed columns while they are still compressed to reduce I/O operations and accelerate the query, ultimately improving the overall query performance. After you import data to a columnstore table, you can initiate a data compression task to optimize the read performance of the table. Note that the minor compaction in a columnstore table proceeds at a slower pace.
Parallel import
Apart from analytical queries, operational OLAP also involves large-scale data import in batches. The parallel execution framework of OceanBase Database allows DML statements to be executed in parallel (parallel DML), enabling concurrent writes across servers while ensuring the consistency of large transactions. This, to a great extent, optimizes the performance of the LSM-tree-based storage engine when it handles large transactions under memory pressure.
Let's try out PDML through an example: Based on the lineitem table of the TPC-H benchmark test, create a table named lineitem2 that has the same schema as the lineitem table. Then, execute an INSERT INTO ... SELECT statement to insert all 6 million rows from the lineitem table to the lineitem2 table. We will perform the operation with PDML enabled and without PDML enabled to compare the results and the differences.
First, copy the schema of the lineitem table to create the lineitem2 table. Note that in OceanBase Database, data is scaled out by partition for tables. In this example, the lineitem table contains 16 partitions, so the lineitem2 table must contain the same number of partitions:
obclient [test]> SHOW CREATE TABLE lineitem\G
*************************** 1. row ***************************
Table: lineitem
Create Table: CREATE TABLE `lineitem` (
`l_orderkey` bigint(20) NOT NULL,
`l_partkey` bigint(20) NOT NULL,
`l_suppkey` bigint(20) NOT NULL,
`l_linenumber` bigint(20) NOT NULL,
`l_quantity` bigint(20) NOT NULL,
`l_extendedprice` bigint(20) NOT NULL,
`l_discount` bigint(20) NOT NULL,
`l_tax` bigint(20) NOT NULL,
`l_returnflag` char(1) DEFAULT NULL,
`l_linestatus` char(1) DEFAULT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date DEFAULT NULL,
`l_receiptdate` date DEFAULT NULL,
`l_shipinstruct` char(25) DEFAULT NULL,
`l_shipmode` char(10) DEFAULT NULL,
`l_comment` varchar(44) DEFAULT NULL,
PRIMARY KEY (`l_orderkey`, `l_linenumber`),
KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
partition by key(l_orderkey)
(partition p0,
partition p1,
partition p2,
partition p3,
partition p4,
partition p5,
partition p6,
partition p7,
partition p8,
partition p9,
partition p10,
partition p11,
partition p12,
partition p13,
partition p14,
partition p15)
1 row in set
Perform the operation without enabling PDML
After the lineitem2 table is created, insert the data into the table without enabling PDML. Since this is a large transaction involving 6 million rows, increase the default transaction timeout period of OceanBase Database (in μs):
# SET ob_query_timeout = 1000000000;
# SET ob_trx_timeout = 1000000000;
The execution result is as follows:
obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, it takes 107 seconds for OceanBase Database to insert 6 million rows of data into a single transaction without enabling PDML.
Enable PDML
Add a hint to enable PDML. Before you insert the data, clear the previously inserted data.
obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Check the execution time:
obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)
obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, with PDML enabled, OceanBase Database inserts 6 million rows of data into the lineitem2 table within 22 seconds. PDML improves the performance by about 5 times. This feature helps users who need to process large amounts of data in batches.
Data compression
OceanBase Database has developed its own storage engine based on the LSM-tree structure. In this architecture, data is divided into baseline data (SSTable) and incremental data (MemTable). The baseline data is stored on disk, and incremental modifications are stored in memory. This setup allows the baseline data on disk to be stored in a more compact manner. In addition, since the baseline data on disk is not frequently updated, OceanBase Database uses general compression algorithms to further compress the baseline data, achieving a high compression ratio. This approach does not compromise query and write performance. The following describes how to import a large amount of external data and observe the compression ratio in OceanBase Database.
Data preparation
First, use the data preparation tool CreateData.jar to generate 50 million rows of simulated data to the /home/soft directory. The data generation process may take several minutes. You can also use other tools to generate test data.
#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G t_bigdata_loader.unl
OceanBase Database allows you to import data in the CSV format to the database through various methods. This topic introduces how to use the Load Data statement to import data.
Name the generated files and verify their sizes.
# mv t_bigdata_loader.unl t_f1.csv # du -sh t_f1.csv 10G t_f1.csvThe
t_f1.csvfile contains 8 columns of data, which correspond to different data types, generated by using a random algorithm. Therefore, before you can experience the data compression feature of OceanBase Database, you need to create a table in the tenant and import the data in the CSV file to the table.1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol| 2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ| 3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx| 4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc| ....Create a table named
t_f1in thetestdatabase of thetesttenant. For more information, see the topic on how to create a tenant.
Data import
You can use the built-in Load Data command of OceanBase Database to import data. The Load Data command supports parallel import. Before you import data, perform the following setup. The Load Data command supports local file import on OBServer nodes only. If you want to import data remotely, you can use the obloader tool of OceanBase Database.
obclient [test]> SET GLOBAL secure_file_priv = "/";
obclient [test]> GRANT FILE ON *.* to username;
Notice
For security reasons, you can execute the statement to modify the secure_file_priv parameter only through a local connection. For more information, see secure_file_priv.
After the setup is completed, reconnect the session to take effect. Then set the transaction timeout period of the session again to ensure that the session will not exit due to a timeout during execution.
obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;
Run the following statement to import data:
obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';
As you can see, with parallel import enabled, it takes about 4 minutes to import 10 GB of data. The CPU parallelism for the tenant in this topic is set to 16. You can set an appropriate parallelism level based on your configuration. Higher parallelism leads to faster import.
After the import is complete, query the number of records and the occupied space size in the table.
There are 50 million records in the table.
obclient [test]> SELECT COUNT(*) FROM t_f1; +----------+ | count(*) | +----------+ | 50000000 | +----------+Perform a major compaction.
Log in to the
systenant of the cluster as therootuser and trigger a major compaction to compress the incremental data and baseline data together. You can also manually trigger a compaction by using the following method.# obclient -h127.0.0.1 -P2881 -uroot@sys -Doceanbase -A -p -c obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;Query the state of the compaction. When the following query returns IDLE, the compaction is complete.
obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | TENANT_ID | FROZEN_SCN | FROZEN_TIME | GLOBAL_BROADCAST_SCN | LAST_SCN | LAST_FINISH_TIME | START_TIME | STATUS | IS_ERROR | IS_SUSPENDED | INFO | +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | 1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 | 1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE | NO | NO | | | 1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 | 1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE | NO | NO | | | 1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 | 1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE | NO | NO | | 1 row in setQuery the following statement in the
systenant to check the data storage usage after the data is imported to OceanBase Database.obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,CDB_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1'; +------------+---------------+----------------------------+ | table_name | svr_ip | a.data_size/1024/1024/1024 | +------------+---------------+----------------------------+ | t_f1 | xxx.xx.xxx.xx | 6.144531250000 | +------------+---------------+----------------------------+
The size of the compressed table is approximately 6.145 GB, and the compression ratio is 10/6.145 = 1.62.