Data compression overview |V2.2.77|OceanBase Database| docs|Distributed Database

Data compression overview

Last Updated：2023-08-18 09:26:34 Updated

OceanBase Database provides the advanced compression feature based on a new hybrid row-column storage architecture, high-performance data encoding methods, and a series of comprehensive data compression algorithms, significantly reducing the storage space required for data compressed with the same backend.

Hybrid row-column storage architecture

In OceanBase Database, disk space is allocated as macroblocks with the fixed size of 2 MB. Each macroblock consists of several microblocks with variable sizes. The default size of a microblock is 16 KB.

During major compaction, the database analyzes the data to be stored. If the number of data rows to be stored reaches the specified number, the database analyzes the data in columns and determines whether to store the data by column or by row based on the data characteristics. Then, the original row storage architecture is changed to the hybrid row-column storage architecture. In the hybrid row-column storage architecture, all data in the same row are stored in the same microblock, but all the row data in the microblock is stored by column.

Data encoding

Data encoding improves the query performance, because encoded data can be directly used for query without decoding.

OceanBase Database analyzes data and selects an appropriate encoding algorithm for the data. OceanBase Database supports multiple data encoding methods, including the following common and effective ones:

Dictionary encoding

During dictionary encoding, data is deduplicated and the deduplicated data forms a dictionary. The locations storing original data store references to the specific dictionary index. In addition, data in the dictionary is sorted by type, which facilitates data compression. In addition, predicates can be directly pushed down to the dictionary during computing to implement fast iteration based on the binary logic.
RLE encoding

In RLE encoding, consecutive equal data records such as 100, 100, 100, 120, 120, 120, 150, 150, ..., are deduplicated, and only the start row numbers and values of the repeated data records are retained.

RLE encoding is commonly used to process ordered data such as index prefix and index suffix in databases.
Difference encoding

Difference encoding is a type of numeric encoding, applicable to integer data distributed within a small value range. By calculating the minimum and maximum values in the data range and subtracting the minimum value from each data record, this method can encode data with a smaller bit width.
Constant encoding

During constant encoding, the database identifies the most common data record as a constant and records only the values that are not equal to this constant and row numbers of the values.

OceanBase Database also provides other encoding methods including string prefix encoding, Hex encoding, inter-column equivalence encoding, and inter-column substring encoding for you to encode different types of data in different business scenarios. During major compaction, OceanBase Database selects an appropriate encoding method based on the data characteristics and calculates the compression ratio of the data. If the compression ratio is low, OceanBase Database rolls back the encoding and selects another encoding method. This mechanism ensures that the data encoding process does not undermine the data writing performance. If you are familiar with the data characteristics, you can also manually specify the encoding method during table creation.

Data compression

Generally, both data compression and decompression consume resources and affect query performance. A greater data size indicates a higher resource usage and greater performance impact. After the data is encoded, the data size is usually greatly reduced, which means less storage space, lower resource usage of compression and decompression, and smaller performance impact.

OceanBase Database supports various compression algorithms including lz4_1.0, snappy_1.0, zlib_1.0, and zstd_1.0.