This topic introduces the storage structure of OceanBase Database.

The storage engine of OceanBase Database is based on the LSM-tree architecture. Data is divided into static baseline data (stored in an SSTable) and dynamic incremental data (stored in a MemTable). An SSTable is read-only and stored on the disk. After an SSTable is generated, it is not modified. A MemTable can be read and written and is stored in the memory. Data related to DML operations, such as inserting, updating, and deleting, is first written into the MemTable. After the size of the MemTable reaches the specified threshold, its data is compacted with the baseline data and stored in the SSTable on the disk. After OceanBase Database receives a user query, it queries both the SSTable and the MemTable, merges the query results, and then returns the results to the SQL layer. OceanBase Database implements both block cache and row cache in the memory to prevent random read of the baseline data.
When the incremental data in a MemTable reaches a specified size, it is compacted with the baseline data and then persisted to the disk. The system performs a daily major compaction during idle hours every night.
OceanBase Database is essentially a storage engine for baseline and incremental data. OceanBase Database integrates the benefits of the LSM-tree architecture and the storage engines of specific traditional relational databases.
Traditional databases divide data into many pages. OceanBase Database borrows this idea and divides data files into macroblocks of 2 MB. Each macroblock is split into multiple variable-length microblocks. During major compactions, data is reused based on the granularity of macroblocks, and non-updated macroblocks are not accessed or read again. This minimizes write amplification during major compactions and significantly reduces compaction costs compared with databases built on the traditional LSM-tree architecture.
OceanBase Database uses the baseline/incremental data design, which means that some data is baseline data and some is incremental data. In principle, every query reads both the baseline and incremental data. Therefore, OceanBase Database has undergone many optimizations, especially for single-row queries. In addition to the data block cache, OceanBase Database also caches rows. Row cache significantly accelerates single-row queries. OceanBase Database builds bloom filters to filter queries for rows that do not exist and caches these filters. Most operations related to online transaction processing (OLTP) business are small queries. OceanBase Database optimizes small queries to eliminate the overhead of parsing the entire data block as traditional databases do. This makes the performance of OceanBase Database comparable to that of an in-memory database. In addition, OceanBase Database can use effective compression algorithms, since the baseline data is read-only and is continuously stored. This ensures a high compression ratio and greatly reduces the cost without affecting the query performance.
By leveraging the benefits of traditional databases, OceanBase Database provides a more general relational database storage engine based on the LSM-tree architecture and achieves the following benefits:
High cost efficiency: In the LSM-tree architecture, written data is no longer updated. OceanBase Database integrates an in-house hybrid row-column encoding method with general compression algorithms, which increases the data compression ratio by more than 10 times compared with traditional databases.
Ease of use: Unlike other LSM-tree-based databases, OceanBase Database allows you to write data of active transactions to disks. This ensures that large or long-running transactions can be run and rolled back as expected. The multi-level major and minor compaction mechanism helps you strike a balance between performance and storage space.
High performance: For common point queries, OceanBase Database provides multi-level cache acceleration to minimize the response delay. For range scans, the storage engine supports calculation pushdown of filter conditions based on data encoding characteristics and provides native vectorization.
High reliability: In addition to end-to-end data verification, OceanBase Database also compares multiple replicas and compares primary tables with index tables during global major compactions based on its native distributed capabilities to ensure data correctness. In addition, OceanBase Database supports regular scans for background threads to prevent silent errors.
Features of the storage engine
The storage engine of OceanBase Database provides the following features:
Data storage
Data organization
Like other LSM-tree-based databases, OceanBase Database also classifies data into incremental data (stored in a MemTable) and static data (stored in an SSTable). An SSTable is read-only and stored on the disk. After an SSTable is generated, it is not modified. A MemTable can be read and written and is stored in the memory. Data related to DML operations, such as inserting, updating, and deleting, is first written into the MemTable. After the size of the MemTable reaches the specified threshold, its data is compacted with the baseline data and stored in the SSTable on the disk.
In addition, SSTables in OceanBase Database are further classified into mini SSTables, minor SSTables, and major SSTables. Data in the MemTable is compacted to mini SSTables. Multiple mini SSTables are compacted into a minor SSTable on a periodic basis. During daily compactions specific to OceanBase Database, all mini SSTables and minor SSTables of each partition are compacted into a major SSTable.
Storage structure
In OceanBase Database, the basic storage unit of each partition is an SSTable, while the basic storage units of the entire database are macroblocks. When OceanBase Database is started, OceanBase Database divides each data file into macroblocks of 2 MB. Each SSTable is a collection of macroblocks.
Each macroblock is further split into multiple microblocks. The concept of microblock is similar to that of page or block in traditional databases. However, based on the LSM-tree architecture, microblocks in OceanBase Database are compressed. When you create a table, you can set the
block_sizeparameter to specify the size of microblocks before compression.OceanBase Database stores microblocks in the encoding or flat format based on the storage format that you specify. Microblock data in the encoding format is stored in hybrid row-column storage mode. Microblock data in the flat format is distributed in rows.
Compression and encoding
OceanBase Database encodes and compresses microblock data based on the mode specified in the user table. If encoding is enabled in the user table, OceanBase Database uses the dictionary, run-length, constant, or differential encoding method to encode data of each microblock by column. After compression of data in each column, OceanBase Database performs inter-column equivalence encoding or inter-column substring encoding for multiple columns. Encoding helps compress your data to a smaller size. In addition, the extracted in-column characteristic information can speed up subsequent queries.
After data is encoded, OceanBase Database allows you to compress microblock data in a lossless manner by using general compression algorithms. This further increases the data compression ratio.
Minor and major compactions
Minor compaction
The concept of minor compaction in OceanBase Database is similar to that of compaction in other LSM-tree-based databases. During minor compactions, data in MemTables is written to SSTables, and data in multiple SSTables is compacted. OceanBase Database adopts the leveled and size-tiered compaction strategy. A database is divided into three layers. The L1 and L2 layers are at a fixed level, and the L0 layer is size-tiered. In the L0 layer, compactions are performed based on the write amplification factor and the number of SSTables.
Major compaction
In OceanBase Database, the concept of major compaction, also known as daily compaction, has a minor difference from other LSM-tree-based databases. This concept was first proposed to indicate an overall compaction of the entire cluster at about 2:00 every morning. Major compactions are initiated by RootService of each tenant based on the write status or user settings. RootService selects a global snapshot in each major compaction of each tenant and performs major compactions for all partitions of the tenant by using the snapshot data. In this way, each major compaction of all data of the tenant generates an SSTable based on the unified snapshot. This helps you integrate incremental data regularly, improves read performance, and provides a natural data verification point. With global consistency snapshots, OceanBase Database supports physical data verification between multiple replicas and between primary tables and index tables.
Queries, reads, and writes
Insert
All data tables in OceanBase Database are clustered index tables. OceanBase Database maintains hidden primary keys even for heap tables without primary keys. Therefore, when you insert data, OceanBase Database checks whether data with the same primary key already exists in the current data table before it writes the data to the MemTable. To speed up the query for duplicate primary keys, background threads asynchronously schedule and build bloom filters for SSTables based on the frequency of macroblock duplication assessment.
Update
OceanBase Database is built on the LSM-tree architecture. Therefore, each update in OceanBase Database inserts a new data row. In contrast to CLOGs, data written to the MemTable during an update contains only new values of the updated column and the corresponding primary key column. The updated row does not necessarily contain data of all table columns. Continuous background compactions merge these incremental updates to speed up your queries.
Delete
Similar to updates, deletions do not directly apply to the original data. Instead, a row of data is written based on the primary key of the deleted row, and a line header is used to mark the deletion. A large number of deletions are not suited to an LSM-tree-based database. Even if a data range is deleted, the database must iterate all rows with deletion markers in this range and merge the rows before it can confirm the deletion status. OceanBase Database provides an internal range deletion marker logic to resolve this issue. OceanBase Database also allows you to explicitly specify a table mode. In this way, OceanBase Database can recycle deleted rows by using special minor and major compactions to speed up your queries.
Query variables
Data is incrementally updated. Therefore, when you query each row of data, OceanBase Database must go through all MemTables and SSTables from the latest to the earliest versions, merge data corresponding to the primary keys in all tables, and return the merged data. OceanBase Database speeds up data access based on caches. In large queries, the SQL layer pushes down filter conditions to the storage layer, and the storage layer performs quick underlying filtering based on characteristics of stored data. In addition, OceanBase Database supports batch calculations and responses in vectorization scenarios.
Multi-level caches
To improve performance, OceanBase Database supports multi-level caches: block cache for data microblocks in queries, row cache for each SSTable, fuse row cache for fusion query results, and Bloom filter cache for null judgment during insertion. All caches in a tenant share the memory. When data is written to the MemTable in an excessively fast manner, you can flexibly schedule memory resources of cached objects for data writes.
Data Verification
As a financial-grade relational database, OceanBase Database always prioritizes data quality and security. OceanBase Database verifies every piece of persistent data in all processes. By leveraging the inherent benefits of multi-replica storage, OceanBase Database also verifies data between replicas to ensure overall data consistency.
Logical verification
In a common deployment mode, each user table in OceanBase Database has multiple replicas. During daily major compaction of a tenant, consistent baseline data is generated for all replicas based on a globally unified snapshot version. When the compaction is completed, OceanBase Database compares data of all replicas with the data checksum to ensure exact consistency. Furthermore, OceanBase Database compares data with the checksum of indexed columns based on the index of the user table. In this way, program errors do not cause inaccuracy of the data returned to you.
Physical verification
OceanBase Database records checksums in microblocks (the smallest I/O granularity in data storage), macroblocks, SSTables, and partitions to verify data in each data read operation. To prevent issues caused by underlying storage hardware, OceanBase Database immediately verifies data after it is written to macroblocks during minor or major compaction. In addition, each server runs a background data inspection thread to scan and verify overall data on a regular basis to detect silent errors on disks.