This topic describes parallel import and data compression.
Parallel import
In addition to analytical queries, another crucial aspect of Operational OLAP is the parallel import of large volumes of data, also known as batch processing capability. The parallel execution framework of the OceanBase database can execute DML statements concurrently (Parallel DML), enabling multi-machine concurrent writes for multi-node databases while ensuring consistency for large transactions. Combined with an asynchronous flushing mechanism, it can significantly optimize the support for large transactions in the LSM-Tree storage engine when memory is constrained.
Try out PDML with this example: Based on the lineitem table of the TPC-H benchmark test, create a table named lineitem2 with the same schema. Then, execute an INSERT INTO ... SELECT statement to insert all 6 million rows from the lineitem table to the lineitem2 table. Perform the following steps to observe the results with PDML enabled and compare them with those without PDML enabled.
First, copy the schema of the lineitem table to create the lineitem2 table. Note that in OceanBase Database, you can scale out a table by partition. In this example, the lineitem2 table also has 16 partitions:
obclient [test]> SHOW CREATE TABLE lineitem\G
*************************** 1. row ***************************
Table: lineitem
Create Table: CREATE TABLE `lineitem` (
`l_orderkey` bigint(20) NOT NULL,
`l_partkey` bigint(20) NOT NULL,
`l_suppkey` bigint(20) NOT NULL,
`l_linenumber` bigint(20) NOT NULL,
`l_quantity` bigint(20) NOT NULL,
`l_extendedprice` bigint(20) NOT NULL,
`l_discount` bigint(20) NOT NULL,
`l_tax` bigint(20) NOT NULL,
`l_returnflag` char(1) DEFAULT NULL,
`l_linestatus` char(1) DEFAULT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date DEFAULT NULL,
`l_receiptdate` date DEFAULT NULL,
`l_shipinstruct` char(25) DEFAULT NULL,
`l_shipmode` char(10) DEFAULT NULL,
`l_comment` varchar(44) DEFAULT NULL,
PRIMARY KEY (`l_orderkey`, `l_linenumber`),
KEY `I_L_ORDERKEY` (`l_orderkey`) BLOCK_SIZE 16384 LOCAL,
KEY `I_L_SHIPDATE` (`l_shipdate`) BLOCK_SIZE 16384 LOCAL
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPACT COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 TABLEGROUP = 'x_tpch_tg_lineitem_order_group'
partition by key(l_orderkey)
(partition p0,
partition p1,
partition p2,
partition p3,
partition p4,
partition p5,
partition p6,
partition p7,
partition p8,
partition p9,
partition p10,
partition p11,
partition p12,
partition p13,
partition p14,
partition p15)
1 row in set
Perform the insertion without enabling PDML
After creating the lineitem2 table, insert the data without enabling PDML. Since this is a large transaction involving 6 million rows, increase the default transaction timeout period of OceanBase Database (in μs):
# SET ob_query_timeout = 1000000000;
# SET ob_trx_timeout = 1000000000;
The execution result is as follows:
obclient [test]> INSERT INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (1 min 47.312 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, it takes 107s for OceanBase Database to insert 6 million rows in a transaction when PDML is disabled.
Enable PDML
Add a hint to enable PDML. Note that before you insert the data again, clear the previously inserted data.
obclient [test]> TRUNCATE TABLE lineitem2;
obclient [test]> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Check the execution time:
obclient> TRUNCATE TABLE lineitem2;
Query OK, 0 rows affected (0.108 sec)
obclient> INSERT /*+ parallel(16) enable_parallel_dml */ INTO lineitem2 SELECT * FROM lineitem;
Query OK, 6001215 rows affected (22.117 sec)
Records: 6001215 Duplicates: 0 Warnings: 0
As we can see, with PDML enabled, OceanBase Database inserts 6 million rows in about 22s. PDML improves the performance by about 5 times. This feature helps users who need to process large volumes of batch data.
Data compression
OceanBase Database has developed its own storage engine based on the LSM-tree structure. In this architecture, data is stored in baseline data files (SSTables) on disk and incremental data changes are stored in memory (MemTables). This setup allows OceanBase Database to store data on disk in a more compact format. In addition, since the baseline data on disk is not frequently updated, OceanBase Database uses general compression algorithms to further compress the baseline data, achieving a high compression ratio. This approach does not compromise query and write performance. The following describes how to import a large amount of external data to OceanBase Database and observe the compression ratio.
Data preparation
First, use the data preparation tool CreateData.jar to generate 50 million rows of simulated data to the /home/soft directory. The data generation process may take several minutes. You can also use other tools to generate test data.
#mkdir /home/soft/
#java -jar CreateData.jar /home/soft/ 50000000
filePath is : /home/soft/
Start Time : 2022-07-10 15:33:00
End Time : 2022-07-10 15:52:53
#du -sh *
10G t_bigdata_loader.unl
OceanBase Database allows you to import data in the CSV format to the database through multiple methods. This topic introduces the method of using the Load Data statement.
Name the generated files and verify their sizes.
# mv t_bigdata_loader.unl t_f1.csv # du -sh t_f1.csv 10G t_f1.csvThe
t_f1.csvfile contains 8 columns of data, which correspond to different data types, based on the random algorithm. Therefore, create a table in the tenant, and import the data in thet_f1.csvfile to the table.1|1896404182|1980-06-01|2004-10-25 13:30:39|7470.689|33062564.9527|nOLqnBYtnp|BzWYjZjeodtBNzXSMyBduMNzwDPSiVmhVgPJMeEkeAwKBCorzblwovIHDKBsQhbVjQnIdoeTsiLXTNwyuAcuneuNaol| 2|572083440|2018-11-09|1998-07-11 01:23:28|6891.054|66028434.4013|UzqteeMaHP|vQWbWBXEWgUqUTzqsOSciiOuvWVcZSrlEOQDwDVGmvGRQYWmhCFdEkpsUsqrWEpKtmxSwURHIHxvmlXHUIxmfelYboeGEuScKKqzpuNLryFsStaFTTRqSsVlCngFFjHnEnpaCnWsdwztbiHJyoGkaxrFmyPAmVregfydArrUZsgRqBpQ| 3|1139841892|2006-10-07|1999-06-26 17:02:22|286.43692|51306547.5055|KJJtylgxkv|BuBdFTBIIFsEPVxsVBRqAnFXSBdtZDgfumUhIx| 4|1777342512|1982-12-18|2017-11-19 07:56:35|2986.242|85860387.8696|rTkUBWhdPt|JSazOTAmvtCBrINttDwublNJNRFDIiWkHtWZXmWgKHoZCKGqmmETkIcYLXiSgKkoaATNgjvPxVGjeCOODLEWqrQHqowbMjOLOKrtirWEOpUSxiUudZduTCUvZElKzZfggvCBNthwzKJc| ....In the
testtenant, create a database namedtest, and create a table namedt_f1in the database.
Data import
You can use the built-in Load Data command of OceanBase Database to import data. The Load Data command supports parallel import. Before you import data, perform the following setup. The Load Data command supports importing data files only from the local disk of the OBServer node. If you want to import data remotely, you can use the obloader tool of OceanBase Database.
obclient [test]> SET GLOBAL secure_file_priv = "/";
obclient [test]> GRANT FILE ON *.* to username;
Notice
For security reasons, you can execute the SQL statement that modifies the secure_file_priv parameter only from a local client. For more information, see secure_file_priv.
After the setup is complete, reconnect the session to apply the changes. Set the transaction timeout period of the session again to ensure that the session does not time out during the execution.
obclient [test]> SET ob_query_timeout=1000000000;
obclient [test]> SET ob_trx_timeout=1000000000;
Then, run the import statement:
obclient [test]> LOAD DATA /*+ parallel(16) */ INFILE '/home/soft/t_f1.csv' INTO table t_f1 fields TERMINATED BY '\|' LINES TERMINATED BY '\n';
As you can see, with parallel import enabled, it takes about 4 minutes to import 10 GB of data. The CPU parallelism for the tenant in this topic is 16. You can set an appropriate parallelism based on your configuration. Higher parallelism leads to faster import.
After the import is complete, query the number of records and the occupied space size in the table.
The query result shows that the table contains 50 million records.
obclient [test]> SELECT COUNT(*) FROM t_f1; +----------+ | count(*) | +----------+ | 50000000 | +----------+Initiate a major compaction.
To view the compression effect of the baseline data, log in to the
systenant of the cluster as therootuser and initiate a major compaction so that the incremental data can be merged with the baseline data and compressed. You can also manually trigger a major compaction by using the following method.# obclient -h127.0.0.1 -P2881 -uroot@sys -Doceanbase -A -p -c obclient[oceanbase]> ALTER SYSTEM MAJOR FREEZE;Query the state of the major compaction. When the result returns IDLE, the major compaction is complete.
obclient [oceanbase]> SELECT * FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | TENANT_ID | FROZEN_SCN | FROZEN_TIME | GLOBAL_BROADCAST_SCN | LAST_SCN | LAST_FINISH_TIME | START_TIME | STATUS | IS_ERROR | IS_SUSPENDED | INFO | +-----------+---------------------+----------------------------+----------------------+---------------------+----------------------------+----------------------------+--------+----------+--------------+------+ | 1 | 1679248800404017149 | 2023-03-20 02:00:00.404017 | 1679248800404017149 | 1679248800404017149 | 2023-03-20 02:00:44.035785 | 2023-03-20 02:00:00.442481 | IDLE | NO | NO | | | 1001 | 1679248804191204504 | 2023-03-20 02:00:04.191205 | 1679248804191204504 | 1679248804191204504 | 2023-03-20 02:00:46.094551 | 2023-03-20 02:00:04.218608 | IDLE | NO | NO | | | 1002 | 1679248802450394471 | 2023-03-20 02:00:02.450394 | 1679248802450394471 | 1679248802450394471 | 2023-03-20 02:00:33.818514 | 2023-03-20 02:00:02.484814 | IDLE | NO | NO | | 1 row in setQuery the following statement from the
systenant to view the data storage usage after the data is imported to OceanBase Database.obclient [oceanbase]> select b.table_name,a.svr_ip,data_size/1024/1024/1024 from CDB_OB_TABLET_REPLICAS a,CDB_OB_TABLE_LOCATIONS b where a.tablet_id=b.tablet_id and b.table_name='T_F1'; +------------+---------------+----------------------------+ | table_name | svr_ip | a.data_size/1024/1024/1024 | +------------+---------------+----------------------------+ | t_f1 | xxx.xx.xxx.xx | 6.144531250000 | +------------+---------------+----------------------------+
The size of the compressed table is approximately 6.145 GB, and the compression ratio is 10/6.145 = 1.62.