
The storage engine of OceanBase Database is based on the LSM-Tree architecture. It divides data into static baseline data (stored in SSTables) and dynamic incremental data (stored in MemTables). An SSTable is read-only and is no longer modified after it is generated. It is stored on disk. A MemTable supports both reads and writes and resides in memory. DML operations, such as inserts, updates, and deletes, are first written to the MemTable. When the MemTable reaches a certain size, it is written to disk as an SSTable. During query execution, OceanBase Database needs to query both SSTables and MemTables, merge the query results, and return the merged results to the SQL layer. OceanBase Database also implements Block Cache and Row cache in memory to avoid random reads of baseline data.
When the incremental data in memory reaches a certain scale, OceanBase Database triggers compaction between the incremental data and baseline data and writes the incremental data to disk. In addition, the system automatically performs daily major compaction during idle periods at night.
OceanBase Database is essentially a baseline-plus-incremental storage engine. While retaining the advantages of the LSM-Tree architecture, it also draws on some strengths of traditional relational database storage engines.
Traditional databases divide data into many pages. OceanBase Database also adopts this idea by splitting data files into macroblocks with a basic granularity of 2 MB. Each macroblock is further divided into multiple variable-length microblocks. During compaction, data is reused at the macroblock granularity. Macroblocks that contain no updated data are not reopened or read again. This minimizes write amplification during compaction and significantly reduces compaction costs compared with traditional LSM-Tree databases.
Because OceanBase Database adopts a baseline-plus-incremental design, some data resides in the baseline layer and some in the incremental layer. In principle, each query needs to read both the baseline data and incremental data. To address this, OceanBase Database provides many optimizations, especially for single-row queries. In addition to caching data blocks internally, OceanBase Database also caches rows. Row caching greatly accelerates single-row query performance. For "empty queries" where the row does not exist, OceanBase Database builds Bloom filters and caches them. Most OLTP workloads consist mainly of small queries. Through small-query optimization, OceanBase Database avoids the overhead of parsing an entire data block in traditional databases and achieves performance close to that of an in-memory database. In addition, because baseline data is read-only and stored internally in a contiguous format, OceanBase Database can use more aggressive compression algorithms. This provides a high compression ratio without affecting query performance and greatly reduces storage costs.
By drawing on some strengths of classic databases, OceanBase Database provides a more general-purpose relational database storage engine based on the LSM-Tree architecture. It has the following features:
Low cost: Taking advantage of the LSM-Tree characteristic that written data is no longer updated in place, OceanBase Database combines its self-developed hybrid row-column encoding with general-purpose compression algorithms. As a result, its data storage compression ratio can be more than 10 times higher than that of traditional databases.
Ease of use: Unlike other LSM-Tree databases, OceanBase Database supports persisting active transactions to disk to ensure that large transactions and long-running transactions can run normally or be rolled back. Its multi-level major and minor compaction mechanisms help users achieve a better balance between performance and storage space.
High performance: For common point queries, OceanBase Database provides multi-level cache acceleration to ensure extremely low response latency. For range scans, the storage engine can use data encoding characteristics to support predicate pushdown, and provides native vectorized execution support.
High reliability: In addition to end-to-end data verification, OceanBase Database uses the advantages of its native distributed architecture. During global major compaction, it ensures the correctness of user data through multi-replica comparison and verification between primary tables and index tables. It also provides background threads that periodically scan data to prevent silent errors.
Storage engine features
In terms of functional modules, the storage engine of OceanBase Database can be broadly divided into the following parts.
Data storage
Data organization
Like other LSM-Tree databases, OceanBase Database divides data into two layers: in-memory incremental data, namely the MemTable, and persistent static data, namely the SSTable. An SSTable is read-only and is no longer modified after it is generated. It is stored on disk. A MemTable supports both reads and writes and resides in memory. DML operations, such as inserts, updates, and deletes, are first written to the MemTable. When the MemTable reaches a certain size, it is written to disk as an SSTable.
In OceanBase Database, SSTables are further divided into three types: Mini SSTable, Minor SSTable, and Major SSTable. A MemTable is written to a Mini SSTable on disk through Mini Compaction. When the number of Mini SSTables reaches a certain threshold, Minor Compaction is triggered to generate Mini SSTables or Minor SSTables. When OceanBase Database’s unique daily major compaction starts, each partition’s original baseline SSTable, namely the Major SSTable, is merged with all Mini SSTables and Minor SSTables into a new Major SSTable.
Storage structure
In OceanBase Database, the basic storage unit of each partition is an SSTable, and the basic granularity of all storage is the macroblock. When the database starts, it divides the entire data file into fixed-length 2 MB macroblocks. Each SSTable is essentially a collection of multiple macroblocks.
Each macroblock is further divided into multiple microblocks. The concept of a microblock is similar to the page/block concept in traditional databases. However, with the help of LSM-Tree characteristics, microblocks in OceanBase Database are compressed and variable-length. The uncompressed size of a microblock can be specified by the
block_sizeparameter when a table is created.Depending on the storage format specified by the user, microblocks can be stored in either encoding format or flat format. In an encoding-format microblock, internal data is stored in a hybrid row-column format. In a flat-format microblock, all data rows are stored in a flat layout.
Compression and encoding
OceanBase Database encodes and compresses data in microblocks according to the mode specified for the user table. When encoding is enabled for a user table, data in each microblock is encoded by column. Encoding rules include dictionary encoding, run-length encoding, constant encoding, and delta encoding. After each column is compressed, additional inter-column encoding rules, such as equality and substring encoding across multiple columns, are applied. Encoding not only helps users greatly compress data, but the extracted intra-column feature information can also further accelerate subsequent queries.
After encoding and compression, OceanBase Database also supports applying a user-specified general-purpose lossless compression algorithm to microblock data to further improve the data compression ratio.
Minor and major compaction
Minor compaction
Minor compaction includes two processes: mini compaction and minor compaction. When the size of the MemTable in memory exceeds a certain threshold, the data in the MemTable needs to be written to a mini SSTable on disk to release memory. This process is called mini compaction. As user data continues to be written, the number of mini SSTables keeps increasing. When the number of mini SSTables exceeds a certain threshold, minor compaction is automatically triggered in the background.
Major compaction
Major compaction in OceanBase Database, is slightly different from compaction in other LSM-Tree databases. As the name suggests, when this concept was first introduced, the goal was to perform a cluster-wide compaction operation once a day at around 2:00 a.m. Major compaction is generally scheduled by the RS of each tenant based on the write status or user settings. Each major compaction of a tenant selects a global snapshot point. All partitions in the tenant perform Major Compaction using the data at this snapshot point. In this way, each major compaction generates the corresponding SSTables for all tenant data based on the same unified snapshot point. This mechanism not only helps users periodically consolidate incremental data and improve read performance, but also provides a natural data verification point. With a globally consistent point, OceanBase Database can internally perform multi-dimensional physical data verification across multiple replicas and between primary tables and index tables.
Query, read, and write operations
Insert
In OceanBase Database, all data tables can be regarded as index-clustered tables. Even for heap tables without a primary key, OceanBase Database internally maintains a hidden primary key for them. Therefore, when a user inserts data, before writing new user data into the MemTable, OceanBase Database needs to check whether data with the same primary key already exists in the current table. To accelerate this duplicate primary key check, a background thread asynchronously schedules Bloom filter construction for each SSTable based on the duplicate-check frequency of different macroblocks.
Update
As an LSM-Tree database, each update in OceanBase Database also inserts a new row of data. Unlike the Clog, updated data written into the MemTable contains only the new values of the updated columns and the corresponding primary key columns. In other words, the updated row does not necessarily contain data for all columns of the table. During continuous background compaction, these incremental updates are gradually merged together to accelerate user queries.
Delete
Similar to updates, delete operations do not directly act on the original data. Instead, OceanBase Database writes a row containing the primary key of the row to be deleted and uses a row-header marker to indicate the delete operation. A large number of delete operations is unfriendly to LSM-Tree databases. This means that even after a data range has been completely deleted, the database still needs to iterate through all delete-marker rows in that range and complete the merge before it can confirm the deleted state. To address this scenario, OceanBase Database provides internal range-delete marker logic. It also supports allowing users to explicitly specify the table mode, so that these deleted rows can be reclaimed earlier through special minor and major compaction methods to accelerate queries.
Query
Because of the incremental update strategy, when querying each row of data, OceanBase Database needs to traverse all MemTables and SSTables by version from newest to oldest, merge the data with the corresponding primary key in each table, and return the result. During data access, caches are used as needed for acceleration. For large-query scenarios, the SQL layer pushes down filter conditions to the storage layer. The storage layer uses stored data characteristics for fast low-level filtering, and supports batch computation and result return in vectorized scenarios.
Multi-level cache
To improve performance, OceanBase Database supports a multi-level cache system. It provides Block Cache for data microblocks, Row Cache for each SSTable, Fuse Row Cache for merged query results, and Bloom Filter Cache for empty-checks during inserts. All caches under the same tenant share memory. When MemTable writes are too fast, memory can be flexibly reclaimed from existing cache objects for write operations.
Data verification
As a financial-grade relational database, OceanBase Database has always placed data quality and security first. Every component involving persisted data throughout the data lifecycle is protected by data verification. At the same time, by leveraging the inherent advantages of multi-replica storage, OceanBase Database also adds inter-replica data verification to further verify overall data consistency.
Logical verification
In common deployment modes, each user table in OceanBase Database has multiple replicas. During a tenant’s daily major compaction, all replicas generate consistent baseline data based on the globally unified snapshot version. Based on this characteristic, OceanBase Database compares the data checksums of all replicas after major compaction is completed to ensure that they are completely consistent. Furthermore, based on the indexes of user tables, OceanBase Database also compares checksums of indexed columns to ensure that the data finally returned to users is not incorrect due to internal program issues.
Physical verification
For data storage, OceanBase Database records corresponding checksums starting from the microblock, the smallest I/O granularity for data storage, and continues to record checksums at the microblock, macroblock, SSTable, and partition levels. Data verification is performed every time data is read. To prevent issues caused by underlying storage hardware, OceanBase Database immediately revalidates data after writing data macroblocks during minor or major compaction. Finally, each server has a background data inspection thread that periodically scans and verifies all data to detect disk silent errors in advance.
