This topic describes parallel import and data compression.
Scenarios where they are commonly used include:
Data migration: Parallel import can significantly speed up data migration when you need to migrate a large amount of data between systems. Data compression can reduce the required transmission bandwidth and the amount of storage resources.
Backup and restore: Data compression can reduce the size of backup files to save storage space. Parallel import can speed up data restore.
Columnstore tables: During a query on a columnstore table, the system usually needs to access only the involved columns rather than entire rows. Therefore, the compressed columns can still be scanned and processed while they are compressed. This reduces the I/O operations and accelerates the query, ultimately improving the overall query performance. We recommend that you compress data in columnstore tables after you import the data in batches. Note that the major compaction in a columnstore table proceeds at a slower speed.
Parallel import
Apart from analytical queries, operational OLAP also involves the parallel import of a large amount of data, which is also known as batch processing. The parallel execution framework of OceanBase Database allows DML statements to be executed in parallel (Parallel DML), enabling concurrent writes across servers while ensuring the consistency of large transactions. The asynchronous minor compaction feature can significantly optimize the support for large transactions in the LSM-tree-based storage engine when memory is tight.
Try out PDML in this example: Based on the lineitem table of the TPC-H benchmark test, create a new table named lineitem2 with the same schema. Then use an INSERT INTO ... SELECT statement to insert all 6 million rows from the lineitem table to the lineitem2 table. Perform the following operations to observe the results and the differences with PDML enabled.
First, copy the schema of the lineitem table to create the lineitem2 table. Note that in OceanBase Database, you can scale out data by partition for a table. In this example, the lineitem2 table also has 16 partitions:
obclient [test]> SHOW CREATE TABLE lineitem\G
*************************** 1. row ***************************
Table: lineitem
Create Table: CREATE TABLE `lineitem` (
`l_orderkey` bigint(20) NOT NULL,
`l_partkey` bigint(20) NOT NULL,
`l_suppkey` bigint(20) NOT NULL,
`l_linenumber` bigint(20) NOT NULL,
`l_quantity` bigint(20) NOT NULL,
`l_extendedprice` bigint(20) NOT NULL,
`l_discount` bigint(20) NOT NULL,
`l_tax` bigint(20) NOT NULL,
`l_returnflag` char(1) DEFAULT NULL,
`l_linestatus` char(1) DEFAULT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date DEFAULT NULL,
`l_receiptdate` date DEFAULT NULL,
`l_shipinstruct` char(25) DEFAULT NULL,
`l_shipmode` char(10) DEFAULT NULL,
`l_comment` varchar(44) DEFAULT NULL,
PRIMARY KEY (`l_orderkey`, `l_linenumber`),
KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
partition by key(l_orderkey)
(partition p0,
partition p1,
partition p2,
partition p3,
partition p4,
partition p5,
partition p6,
partition p7,
partition p8,
partition p9,
partition p10,
partition p11,
partition p12,
partition p13,
partition p14,
partition p15)
1 row in set
Perform the insertion without enabling PDML
After the lineitem2 table is created, insert the data without enabling PDML. Remember to set the default transaction timeout period of OceanBase Database to a larger value (in μs) because this is a large transaction involving the insertion of 6 million rows:
# SET ob_query_timeout = 1000000000;
# SET ob_trx_timeout = 1000000000;
The execution result is as follows:
obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, it takes 107 seconds for OceanBase Database to insert 6 million rows in a single transaction when PDML is disabled.
Enable PDML
Add a hint to enable PDML. Note that before you insert the data again, you must clear the previously inserted data.
obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Check the execution time:
obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)
obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, with PDML enabled, OceanBase Database inserts 6 million rows in approximately 22 seconds. PDML improves the performance by about 5 times. This feature can help users who need to process batch data.
Data compression
OceanBase Database has developed its own storage engine based on the LSM-tree structure. In this architecture, data is divided into baseline data (SSTable) and incremental data (MemTable). Baseline data is stored in the disk, while incremental modifications are tracked in the memory. This approach not only enables more compact storage of data in the disk but also allows OceanBase Database to achieve a high compression ratio for data stored in the database through re-compression based on general compression algorithms. Importantly, this data compression does not compromise query and write performance. The following describes how to import a large amount of external data and observe the compression ratio of the data in OceanBase Database.
Data preparation
First, use the data preparation tool CreateData.jar to generate 50 million rows of simulated data to the /home/soft directory. The data generation process may take several minutes. You can also use other tools to generate test data.
#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G t_bigdata_loader.unl
OceanBase Database allows you to import data in the CSV format to the database through multiple methods. This topic introduces how to import data by using the Load Data statement.
Name the generated files and verify their sizes.
# mv t_bigdata_loader.unl t_f1.csv # du -sh t_f1.csv 10G t_f1.csvView the content of the
t_f1.csvfile. You will find that the random algorithm has generated 8 columns of data based on predefined rules, which can correspond to different data types. Therefore, before you explore the data compression feature of OceanBase Database, create a table in the tenant and import the records from the CSV file to the table.1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol| 2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ| 3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx| 4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc| ....In the
testtenant andtestdatabase, create a table namedt_f1. For more information, see the topic on how to create a tenant.
Data import
You can use the built-in Load Data command of OceanBase Database to import data. The Load Data command supports parallel import. Before you import data, perform the following setup. The Load Data command supports only local data file import on the OBServer node. If you want to import data remotely, you can use the obloader tool of OceanBase Database.
obclient [test]> SET GLOBAL secure_file_priv = "/";
obclient [test]> GRANT FILE ON *.* to username;
Notice
For security reasons, you can execute the SQL statement that modifies the secure_file_priv parameter only through a local connection. For more information, see secure_file_priv.
After the setup is completed, reconnect to make the changes take effect. Then set the transaction timeout period of the session to ensure that the transaction does not time out during execution.
obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;
Then run the import statement:
obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';
As you can see, with parallel import enabled, it takes about 4 minutes to import 10 GB of data. The CPU parallelism for the tenant in this topic is set to 16. You can set an appropriate parallelism level based on your configuration. Higher parallelism results in a faster import.
After the import is completed, query the number of records and the occupied space size in the table.
There are 50 million records in the table.
obclient [test]> SELECT COUNT(*) FROM t_f1; +----------+ | count(*) | +----------+ | 50000000 | +----------+Perform a major compaction.
To view the compression effect of the baseline data, log in to the
systenant of the cluster as therootuser and manually trigger a major compaction so that the incremental data can be compressed together with the baseline data. You can also use the following command to manually trigger a major compaction.# obclient -h127.0.0.1 -P2881 -uroot@sys -Doceanbase -A -p -c obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;Query the following statement until it returns IDLE, which indicates that the major compaction is completed.
obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | TENANT_ID | FROZEN_SCN | FROZEN_TIME | GLOBAL_BROADCAST_SCN | LAST_SCN | LAST_FINISH_TIME | START_TIME | STATUS | IS_ERROR | IS_SUSPENDED | INFO | +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | 1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 | 1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE | NO | NO | | | 1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 | 1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE | NO | NO | | | 1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 | 1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE | NO | NO | | 1 row in setQuery the following statement in the
systenant to view the data storage usage after the data is imported to OceanBase Database.obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,CDB_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1'; +------------+---------------+----------------------------+ | table_name | svr_ip | a.data_size/1024/1024/1024 | +------------+---------------+----------------------------+ | t_f1 | xxx.xx.xxx.xx | 6.144531250000 | +------------+---------------+----------------------------+
The size of the compressed table is approximately 6.145 GB, and the compression ratio is 10/6.145 = 1.62.