OceanBase Database provides an advanced compression feature based on a new hybrid row-column storage architecture, high-performance data encoding methods, and a series of comprehensive data compression algorithms. The feature significantly reduces the storage space required for data compressed with the same backend.
General compression
General compression is a process in which a compression algorithm is used to directly compress data blocks regardless of the data structure. This compression method encodes data blocks based on the characteristics of binary data to reduce the redundancy of stored data. Compressed data cannot be randomly accessed. Both compression and decompression are performed on data blocks. OceanBase Database supports four compression algorithms: zlib, snappy, lz4, and zstd. The compression level of zstd and lz4 algorithms is 1. The compression level of the zlib algorithm is 6. The snappy algorithm uses the default compression level. The four algorithms are used to compress microblocks of 16 KB during compression testing in OceanBase Database. During the testing, the snappy and lz4 algorithms achieve high compression speeds with low compression ratios, and the zlib and zstd algorithms achieve high compression ratios with low compression speeds. The lz4 and snappy algorithms have similar compression ratios, but the lz4 algorithm has higher compression and decompression speeds. The zstd and zlib algorithms have similar compression ratios, but the zstd algorithm has higher compression and decompression speeds.
In MySQL mode, OceanBase Database allows you to use one of the preceding compression algorithms. In Oracle mode, OceanBase Database is compatible with the compression algorithms of Oracle Database and allows you to use either the lz4 or zstd algorithm.
Encoding
Based on general compression, OceanBase Database provides a hybrid row-column storage encoding method for databases. In contrast to general compression, encoding is a process in which a compression algorithm compresses data blocks based on the format and semantics of data in data blocks. OceanBase Database is a relational database in which data is organized by table. Each column of a table represents a fixed type of data. Therefore, data in the same column is similar in a logical sense. In specific scenarios, data in adjacent rows of a business table are also similar. Therefore, you can compress and store data by column to improve compression performance. To compress data by column, OceanBase Database introduces microblocks in the encoding format. Unlike microblocks in the flat format in which all data is serialized by row, microblocks in the encoding format are stored in hybrid row-column storage mode. Logically, a microblock still stores a set of row data, but the data is encoded by column. Fixed-length encoded data is stored in the columnstore area of the microblock, and variable-length data is stored in the variable-length area by row. Data in microblocks in encoding format can be randomly accessed. If you want to read a specific row of data in a microblock, you can decode only the row of data. This prevents specific decompression algorithms from decompressing the entire data block when you want to read only a part of data in the data block. To reduce projection overheads, you can also decode only specified columns during vectorized execution.
OceanBase Database selects an appropriate encoding algorithm when it analyzes data. OceanBase Database supports multiple data encoding methods, including the following common and effective methods:
Dictionary encoding
During dictionary encoding, the data of small cardinality is deduplicated and the deduplicated data forms a dictionary. The locations that store original data store references to the specific dictionary subscript. In addition, data in the dictionary is sorted by type, which facilitates data compression. Predicates can be directly pushed down to the dictionary during computing to implement fast iteration based on the binary logic.
Run-length encoding (RLE)
During RLE, consecutive equal data records such as 100, 100, 100, 120, 120, 120, 150, 150, ... are deduplicated, and only the start row numbers and values of the repeated data records are retained.
RLE is commonly used to process the data of small cardinality in databases, such as index prefixes and index suffixes.
Delta encoding
Delta encoding is a type of numeric encoding that applies to integer data distributed within a small value range. By calculating the minimum and maximum values in the data range and subtracting the minimum value from each data record, this method can encode data with a smaller bit width.
Constant encoding
During constant encoding, the database identifies the most common data record as a constant and records only the values that are not equal to this constant and row numbers of the values.
OceanBase Database also provides a variety of other encoding methods, such as prefix encoding, hex encoding, column equal encoding, and column substring encoding. During major compaction, OceanBase Database selects an appropriate encoding method based on the data characteristics and calculates the compression ratio of the data. If the compression ratio is low, OceanBase Database rolls back the encoding and selects another encoding method. This mechanism ensures that the data encoding process does not undermine the data writing performance. If you are familiar with the data characteristics, you can also manually specify the encoding method during table creation.
More information
For more information about data compression and encoding, see Compression and encoding.