This topic describes the technical mechanism of OBLOADER & OBDUMPER.
OBLOADER
You can use this tool to import data into OceanBase Database. In general, the data import procedure of the OBLOADER client consists of two phases: preprocessing and data writing. During preprocessing, OBLOADER performs the following steps:
Queries the metadata of the table into which data is to be imported.
Filters and matches files.
Logically splits large files
Calculates the partitions of subfiles and distributes them across nodes.
In the data writing phase, OBLOADER operates in a mechanism like a multi-producer multi-consumer model. Parsing threads, functioning as producers, continuously parse data from files, cleanse the data, and then store the data in a buffer. Writing threads, functioning as consumers, continuously extract data from the buffer and write the data to the database.
The number of producers and consumers, as well as the size of the buffer, are the main factors that affect the performance of the OBLOADER client. You can specify the number of producers by using the --thread option. The buffer size is calculated by OBLOADER based on the value of the --thread option and the maximum heap memory of Java Virtual Machine (JVM). We recommend that you do not manually specify the buffer size. However, if necessary, you can specify the buffer size by using the --buffer-size option, whose value must be a power of 2.
If you import data by using the direct load method, OBLOADER also experiences a task submission phase. In this phase, the OBLOADER client waits for OBServer nodes to sort and build indexes for the newly written data, which usually takes quite a while.
Logically split large files
OBLOADER does not actually generate several subfiles on disk. It logically splits a large file by recording multiple byte positions of the file and concurrently reads data starting from these byte positions during parsing, thereby improving parsing speed.
The size of each logical subfile is specified by the --block-size option. The default value is 64 MB. The splitting is based on line breaks and, in most cases, data is not truncated. The only exception, which is unlikely though, is that the data itself contains line breaks. You can adjust the file splitting strategy by using the following method:
Use a text editor to open the
<root directory of the tool>/bin/obloaderscript.Find the Java startup parameter
-Dfile.split=unsafeand change its value fromunsafetosafe.
After the change takes effect, OBLOADER will use both delimiters and line breaks for logical splitting.
Calculate partitions
OBLOADER calculates the partition of a subfile by extracting the first data record in the subfile and then queries the node where the partition leader exists. Then, OBLOADER groups subfiles that are concurrently processed and distributes them to different nodes for load balancing.
Failures in partition calculation or node query do not affect data import. However, OBLOADER will print WARN logs. If you use OceanBase Database of a version earlier than V4.0, the failure may be due to not providing credentials of the sys tenant. If you use OceanBase Database V4.0 or later, the failure may be due to not providing the tenant name by specifying the -t option.
OBDUMPER
You can use this tool to export data from OceanBase Database. OBDUMPER uses a table sharding strategy based on schema definitions to achieve parallel execution of multiple database or table tasks, thereby enhancing data export performance. However, the performance is affected by various factors, such as the table schema design, data transmission network bandwidth, and I/O throughput of data persistence. The architecture of OBDUMPER has been extended to support more file and storage formats. Rowstore formats such as CSV and SQL are suitable for tasks involving a small amount of data. For example, you can directly open a CSV file in Excel and view and analyze the data. Tables containing large amounts of data can be stored in columnstore formats such as ORC and Parquet. OBDUMPER also allows you to export data to different storage media.
We recommend that you add a primary key for tables. We also recommend that you split large tables into partitions that contain roughly the same amount of data.
Performance test
This section provides performance metrics of OBDUMPER in a TPC-H test, where OBDUMPER exports data in different file formats by using sample data from a 100 GB dataset. By default, data compression is enabled for exports to ORC and Parquet files. Therefore, the export performance is slightly lower compared to exporting data to CSV and SQL files. In actual business scenarios, most users are willing to sacrifice performance slightly for lower storage space. The following table describes the export performance of OBDUMPER.
Data compression test
| Business model | Format | Average row export performance | Average byte export performance | Total time | Storage size |
|---|---|---|---|---|---|
| TPC-H 100GB | CSV | 1,028,880 rows/s | 144 MB/s | 15.23 min | 117 GB |
| TPC-H 100GB | ORC | 883,218 rows/s | 97 MB/s | 18.41 min | 27 GB |
| TPC-H 100GB | Parquet | 843,158 rows/s | 93 MB/s | 16.03 min | 25 GB |
| TPC-H 100GB | Delimited Text | 1,063,841 rows/s | 134 MB/s | 14.73 min | 107 GB |
By default, OBDUMPER uses the zstd algorithm when it exports data to ORC or Parquet files. The compression ratio of the zstd algorithm in this scenario is about 4:1, which cannot be achieved by compressing an exported CSV file. The compression ratio relies on not only the compression algorithm but also the characteristics of the data. In other words, the same algorithm can yield significantly different results when compressing different types of data. We recommend that you select the compression algorithm based on the characteristics of your business data. The compression ratio is calculated by using the following formula: Compression ratio ≈ Uncompressed size/Compressed size.