
The storage engine of OceanBase Database is built on the LSM-tree architecture, which stores data as static baseline data in SSTables and dynamic incremental data in MemTables. SSTables are read-only and stored on disk. MemTables support read and write and stored in memory. During a database DML operation (insert, update, or delete), the data changes are first written into the MemTable. When the MemTable reaches a specified size, the data in the MemTable is flushed to the disk to become an SSTable. During a query, the storage engine queries the SSTable and MemTable, merges the query results, and returns the merged results to the SQL layer. To avoid random reads of baseline data, the storage engine maintains a block cache and a row cache in memory.
When the incremental data in memory reaches a specified size, the incremental data and baseline data are merged and the incremental data is written to the disk. In addition, every night during off-peak hours, the system automatically performs a daily major compaction.
The storage engine of OceanBase Database is a baseline plus incremental storage engine that combines the advantages of the LSM-tree architecture and conventional relational database storage engines.
Conventional databases store data in pages. OceanBase Database adopts a similar approach by dividing data files into macroblocks of 2 MB and submacroblocks of varying lengths. During a major compaction, data is reused based on the macroblock. Unupdated macroblocks are not reopened for reading. This approach reduces write amplification and significantly lowers the cost of a major compaction compared with a conventional database that uses the LSM-tree architecture.
OceanBase Database is designed to store baseline data and incremental data separately. Therefore, every query needs to read both the baseline data and incremental data. To optimize queries for baseline data and incremental data, OceanBase Database optimizes not only data blocks but also rows. In addition to caching data blocks, OceanBase Database also caches rows. The row cache significantly accelerates single-row queries. For "empty queries" that return no rows, a Bloom filter is constructed and cached to quickly determine whether the queried data exists. Most operations in OLTP business are small queries. OceanBase Database avoids the overhead of parsing entire data blocks in conventional databases and achieves performance comparable to that of memory databases. In addition, baseline data is read-only and stored contiguously. OceanBase Database can therefore apply more aggressive compression algorithms to baseline data without compromising query performance, significantly reducing costs.
By integrating the advantages of conventional databases and the LSM-tree architecture, OceanBase Database provides a more general relational database storage engine and has the following characteristics:
Cost-effective. Leveraging the fact that data written using the LSM-tree architecture is not updated, OceanBase Database applies self-developed hybrid row-column encoding and general compression algorithms to achieve a compression ratio 10 times higher than that in conventional databases.
User-friendly. Unlike other databases that use the LSM-tree architecture, OceanBase Database ensures the performance of large or long transactions by allowing writes and supporting rollbacks for active transactions during a major compaction, and helps users achieve an optimal balance between performance and space by using multi-level minor and major compactions.
High performance. The storage engine accelerates common point queries using multi-level caches to ensure extremely low response latency. For range scans, the storage engine leverages data encoding features to push query filtering conditions to the encoding layer and provides native vectorized support.
High reliability. In addition to end-to-end data verification, the storage engine takes advantage of native distributed features to verify the correctness of user data by comparing replicas during a major compaction and by comparing the primary and index tables. The storage engine also uses a background thread to periodically scan data to preemptively identify and correct silent data errors.
Components of the storage engine
The storage engine of OceanBase Database can be divided into the following components based on their features.
Data storage
Data organization
Like other LSM-tree databases, OceanBase Database stores data in incremental MemTables and static SSTables. MemTables are writable and stored in memory, while SSTables are read-only and stored on disk. DML operations such as INSERT, UPDATE, and DELETE are first written to MemTables. Once a MemTable reaches a certain size, it is flushed to the disk to become an SSTable.
In OceanBase Database, SSTables are further classified into mini SSTables, minor SSTables, and major SSTables. MemTables can be flushed to mini SSTables, and multiple mini SSTables can be merged into a minor SSTable or another mini SSTable. At the start of the daily major compaction in OceanBase Database, all minor SSTables and the baseline major SSTable in each partition are merged into a new major SSTable.
Storage structure
In OceanBase Database, each partition's basic storage unit is an SSTable, and the basic storage grain is a macroblock. When the database starts, the data file is divided into macroblocks of a fixed size of 2 MB. Each SSTable is a collection of macroblocks.
Macroblocks are further divided into microblocks. The concept of microblocks is similar to that of pages or blocks in a traditional database. However, microblocks in OceanBase Database are variable-length and can be compressed. The size of a microblock can be specified by using the
block_sizeparameter when a table is created.Microblocks can be stored in encoding or flat format based on the specified storage format. In an encoded microblock, data is stored in a hybrid row-column mode; in a flat microblock, data rows are stored in a flattened manner.
Compression and encoding
OceanBase Database applies compression and encoding to data in microblocks based on the compression and encoding settings specified for the user table. When encoding is enabled for the user table, data in each microblock is encoded by column. Several encoding rules, such as dictionary, run-length, constant, and delta encoding, are supported. After the data in a column is compressed, an equal-value or substring-based encoding is applied to the data across columns. This helps significantly compress data and accelerate subsequent queries by extracting features within columns.
After compression and encoding, OceanBase Database allows you to further compress the data in microblocks by using the general lossless compression algorithm specified by you, thereby improving the compression ratio.
Minor and major compactions
Minor compaction
A minor compaction is the process of flushing data from MemTables to mini SSTables on the disk to release the memory space. When the size of data in MemTables exceeds the specified threshold, a minor compaction is initiated to flush the data into a mini SSTable. As user data is written, the number of mini SSTables increases. When the number of mini SSTables exceeds the specified threshold, a minor compaction is automatically triggered in the background.
Major compaction
A major compaction, also known as a daily major compaction in OceanBase Database, is different from that in other LSM-tree databases. As the name suggests, the major compaction was originally designed to be performed as a whole-cluster compaction at around 2 a.m. every day. A major compaction is initiated by the root service of each tenant based on the write status or the schedule set by the user. Each major compaction selects a global snapshot point. Then, a major compaction is performed on all partitions in the tenant using the data at the snapshot point. This generates SSTables for all data in the tenant based on the same snapshot point. This mechanism helps users regularly integrate incremental data and improve read performance, and provides a natural data verification point. OceanBase Database can perform multi-dimensional physical data verification of multi-replica data and primary and index tables based on global consistent snapshot points.
Queries and writes
Insert
In OceanBase Database, all data tables, including heap tables without primary keys, are viewed as index cluster tables. Therefore, when you insert data, the system checks whether the data table already contains the same data key before writing the new user data into the MemTable. To speed up the repeated primary key query, the system asynchronously schedules the construction of a Bloom filter for each SSTable by using a background thread with different macroblock-level recheck frequencies.
Update
As an LSM-tree database, OceanBase Database inserts a new row of data for each update. The data updated in the MemTable includes only the new values of the updated columns and the primary key of the updated row. Therefore, an updated row does not necessarily contain the data of all columns of the table. During ongoing background compactions, incremental updates are merged to accelerate queries.
Delete
Similar to updates, deletions are not performed directly on the original data. Instead, a deletion row is written into the table by using its primary key, and the deletion action is marked by using a row header. A series of deletions in an LSM-tree database is not friendly. This is because even if all data of a data range is deleted, the database still needs to iterate all deletion-marked rows in the range and perform a compaction before it can confirm the deletion status. To address this issue, OceanBase Database has an inherent range deletion marking mechanism. Additionally, you can explicitly specify a table mode to enable explicit row deletion and accelerate queries through a special minor compaction mechanism.
Query
When you query a specific row of data, the system traverses all MemTables and SSTables from new to old based on the version and fuses the data corresponding to the data key in each table. During data access, the system uses caches to accelerate data. In a large query, the SQL layer pushes down filter conditions to the storage layer and uses data features for quick filtering at the underlying level. Additionally, it supports batch calculations and result returns in vectorized scenarios.
Multi-level cache
To enhance performance, OceanBase Database supports a multi-level cache system. It provides a block cache for query data microblocks, a row cache for each SSTable, a fuse row cache for query result fusion, and a Bloom filter cache for empty check during insertion. All caches in the same tenant share the memory. When the write speed of MemTables is too fast, the system can flexibly reclaim memory from current caches for writing.
Data verification
As a financial-grade relational database, OceanBase Database prioritizes data quality and security. Data verification is performed on every part of the data persistence layer of the full data link, and data verification is also performed between replicas based on the inherent advantage of multi-replica storage to further ensure the consistency of data as a whole.
Logical verification
In a common deployment mode, each user table in OceanBase Database has multiple replicas. During a daily major compaction for a tenant, all replicas will generate baseline data based on the same global snapshot version. The system will compare the checksums of all replicas after the major compaction is completed to ensure consistency. In addition, the system will compare the checksums of index columns based on user table indexes to ensure that the data returned to users is accurate, regardless of any inherent errors in the application program.
Physical verification
For data storage, OceanBase Database records the checksums of data in each microblock, macroblock, SSTable, or partition and performs data verification each time data is read. To prevent errors caused by underlying storage hardware, the system also verifies the data in a macroblock after the data is written into the macroblock during a minor compaction. In addition, each server has a scheduled scan thread to scan and verify the overall data in the background to detect silent disk errors early.