
The storage engine of OceanBase Database is built on the LSM-tree architecture, which divides data into baseline data stored in SSTables and incremental data stored in MemTables. SSTables are read-only and stored on disk. MemTables support read and write operations and store data in memory. During an insert, update, or delete operation, data is written to MemTables. When the size of a MemTable reaches the specified threshold, it is flushed to disk to become an SSTable. During a query, the storage engine queries both SSTables and MemTables and merges the query results before returning them to the SQL layer. Additionally, the storage engine maintains a block cache and a row cache in memory to avoid random reads of baseline data.
When the size of incremental data in memory reaches the specified threshold, the incremental data and baseline data are merged and written to disk. Every night during off-peak hours, the system automatically performs a daily major compaction.
The storage engine of OceanBase Database is a baseline-incremental storage engine that combines the advantages of the LSM-tree architecture and traditional relational database storage engines.
A traditional database divides data into many pages, and OceanBase Database adopts a similar approach by dividing data files into macroblocks of 2 MB each. Macroblocks are further divided into variable-length microblocks. During a major compaction, data is reused based on the macroblock. Unupdated macroblocks are not reopened for reading. This approach aims to reduce write amplification and significantly lower the cost of a major compaction compared with a traditional LSM-tree database.
OceanBase Database is a baseline-incremental storage engine. Data is stored in both baseline data and incremental data. Therefore, each query needs to read data from both baseline data and incremental data. To optimize queries at the single-row level, OceanBase Database adopts various techniques. In addition to caching data blocks, OceanBase Database also caches rows. Row caching significantly accelerates single-row queries. For "empty searches" that do not return any rows, Bloom filters are built and cached to quickly identify the search requests that do not need to be processed. Most operations in OLTP business are small queries. OceanBase Database optimizes these small queries to avoid the overhead of parsing entire data blocks in traditional databases and achieve performance comparable to that of memory databases. Baseline data is read-only and stored contiguously. Therefore, OceanBase Database can apply aggressive compression algorithms to baseline data to achieve a high compression ratio without compromising query performance, thereby significantly reducing costs.
By integrating the advantages of classical databases and the LSM-tree architecture, OceanBase Database provides a more general relational database storage engine and has the following characteristics:
Cost-effective. Leveraging the fact that data written based on the LSM-tree architecture is not updated, OceanBase Database applies self-developed hybrid row-column encoding and general compression algorithms to significantly increase the compression ratio of data storage compared with traditional databases.
User-friendly. Unlike other databases with the LSM-tree architecture, OceanBase Database ensures the smooth execution or rollback of large and long transactions through the disk-based storage of incremental data and supports multiple major compactions and flushes to help users achieve the optimal balance between performance and space.
High-performance. To ensure extremely low response latency for common point queries, OceanBase Database uses multiple caches to accelerate queries. For range scans, the storage engine uses data encoding features to push query filter conditions and supports native vectorized operations.
High-reliability. In addition to end-to-end data verification, OceanBase Database leverages the distributed nature of the database to verify the accuracy of user data by comparing replicas during global major compactions and verifying the consistency between primary and index tables. It also provides a background thread to periodically scan data to preemptively identify and correct silent data errors.
Components of the storage engine
The storage engine of OceanBase Database can be divided into the following components based on their features.
Data storage
Data organization
Like other LSM-tree databases, OceanBase Database stores data in memory-based incremental data (MemTables) and stored static data (SSTables). MemTables are writable and store data in memory. SSTables are read-only and store data on disk. DML operations such as INSERT, UPDATE, and DELETE are first written to MemTables. When a MemTable reaches the specified size, the data in the MemTable is flushed to disk to become an SSTable.
In OceanBase Database, SSTables are further classified into mini SSTables, minor SSTables, and major SSTables. MemTables can be flushed to mini SSTables on disk through a mini compaction. When the number of mini SSTables reaches the specified threshold, multiple mini SSTables can be merged into a minor SSTable through a minor compaction. At the start of a daily major compaction specific to OceanBase Database, all minor SSTables and the baseline major SSTable in the partition are merged into a new major SSTable.
Storage structure
In OceanBase Database, each partition's basic storage unit is an SSTable, and the basic granularity of storage is a macroblock. When the database starts, the data file is divided into macroblocks of a fixed size of 2 MB. Each SSTable is actually a collection of macroblocks.
Each macroblock is divided into multiple microblocks. The concept of microblocks is similar to that of pages or blocks in a traditional database. However, microblocks in OceanBase Database are variable-length and compressed based on the LSM-tree structure. The size of microblocks can be specified by using the
block_sizeparameter when a table is created.Microblocks can be stored in encoding or flat format based on the specified storage format of the user. Microblocks stored in encoding format have data that is stored in a hybrid row-column mode. For microblocks stored in flat format, all data rows are stored in a flattened manner.
Compression and encoding
OceanBase Database uses encoding and compression to reduce the size of data in microblocks based on the specified mode of the user table. When encoding is enabled for a user table, data in each microblock is encoded by column. The encoding rules include dictionary, run-length, constant, and difference encoding. After the compression of each column is completed, the columns are further encoded based on rules such as equal value and sub-string matching. Encoding helps significantly compress the data and the extracted columnar features can accelerate subsequent queries.
After encoding and compression, OceanBase Database allows you to use a general lossless compression algorithm specified by you to compress the data in microblocks, further improving the compression ratio.
Minor and major compactions
Minor compaction
A minor compaction is initiated when the memory usage of MemTables exceeds the specified threshold. The data in the MemTable is flushed to disk to become a mini SSTable, which is a process known as a minor compaction. As user data is written, the number of mini SSTables increases. When the number of mini SSTables exceeds the specified threshold, a minor compaction is automatically triggered in the background to merge multiple mini SSTables into a minor SSTable.
Major compaction
A major compaction, also known as a daily major compaction in OceanBase Database, differs from those in other LSM-tree databases. As the name suggests, the major compaction was originally designed to be performed as a whole-cluster compaction at around 2 AM every day. A major compaction is initiated by the root service of each tenant based on the write status or user settings. Each major compaction selects a global snapshot point. Then, a major compaction is performed on all partitions in the tenant based on the data at the snapshot point. This way, all data of the tenant is based on the same snapshot point after each major compaction. This mechanism helps users periodically integrate incremental data and improve read performance, and provides a natural data verification point. With the global consistent snapshot, OceanBase Database can perform multi-dimensional physical data verification of multiple replicas and primary table indexes.
Queries and writes
Insertion
In OceanBase Database, all data tables, whether they have primary keys or not, are treated as index cluster tables. In other words, hidden primary keys are maintained for data tables without primary keys. Therefore, when you insert data, OceanBase Database needs to check whether the data with the same primary key already exists in the MemTable before writing the new user data. To speed up the repeated primary key query, an asynchronous background thread builds a Bloom filter for different macroblocks with different repletion frequencies.
Updates
As an LSM-tree database, OceanBase Database inserts a new row of data for each update. Unlike clogs, data updates in MemTables include only the new values of the updated columns and the primary key of the updated row. Therefore, an updated row does not necessarily contain all columns of the table. During ongoing background compactions, incremental updates are merged to speed up queries.
Deletion
Similar to updates, deletions are also performed by writing a deletion record that contains the primary key of the deleted row and is marked with a deletion flag into the table. A large number of deletions are not friendly to an LSM-tree database, because after all data in a data range is deleted, the database still needs to iterate all deleted-marked rows in the data range to ensure the deleted status after data fusion. To address this issue, OceanBase Database has an inherent range deletion marking mechanism. Additionally, you can specify a table mode to initiate a special minor compaction, thereby accelerating the recycling of deleted rows to speed up queries.
Queries
When you query for a row of data, OceanBase Database needs to traverse MemTables and SSTables from new to old versions to find the data with the specified primary key and fuse the data from the tables. During data access, the cache is used to accelerate data retrieval. For large queries, the SQL layer pushes down filter conditions to the storage layer for quick filtering based on data characteristics. This leverages the power of storage layer architecture to speed up queries. Additionally, vectorized batch calculations and result returns are supported in batch processing scenarios.
Multi-level cache
To enhance performance, OceanBase Database supports a multi-level cache system. It provides a block cache for query data microblocks, a row cache for each SSTable, a fuse row cache for query result fusion, and a Bloom filter cache for empty check during insertion. All caches in the same tenant share the memory space. When the write speed of MemTables is too fast, memory can be dynamically occupied from other caches for write use.
Data verification
As a financial-grade relational database, OceanBase Database prioritizes data quality and security. Data verification is performed on every part of the data persistence layer of the entire data chain. Additionally, data verification is performed between replicas based on the inherent advantage of multi-replica storage to further verify the consistency of overall data.
Logical verification
In a common deployment mode, each user table in OceanBase Database has multiple replicas. During a daily major compaction for a tenant, all replicas of the tenant generate baseline data based on a global unified snapshot version. Replicas then compare the checksums of the baseline data to verify consistency of the data. In addition, the checksums of the index columns are compared based on the user table replicas to ensure that the data returned to users is correct, regardless of any inherent errors in the application program.
Physical verification
For data storage, OceanBase Database records the checksums of data at the minimum I/O granularity, namely, microblocks, in each microblock, macroblock, SSTable, and partition. Data verification is performed each time data is read; to prevent issues caused by underlying storage hardware, the data in a macroblock is verified after it is read from the disk and before it is written back to the disk during a minor compaction; in addition, a background data scan and verification thread checks overall data on a server on a regular basis to detect silent disk errors in advance.