OceanBase Database has focused on the LSM-tree architecture from the very beginning. The database supports various types of typical transaction processing (TP) business based on continuous updates of features, and adapts to various extreme loads through continuous performance optimization. With abundant engineering practice and experience, OceanBase Database develops a set of cutting-edge fully featured storage engine based on the LSM-tree architecture. Based on its technology accumulation, OceanBase Database extends the storage engine to support columnar storage in V4.3.0. This implements integrated (column + row) storage on one OBServer node with one set of code and one architecture, thus achieving balance between the performance of TP queries and that of analytical processing (AP) queries.
Overall architecture
As a native distributed database, OceanBase Database stores user data in multiple replicas by default. To leverage its advantages of multi-replica deployment and improve user experience in strong verification of data and reuse of migrated data, OceanBase Database makes several targeted changes in the design of the self-developed LSM-tree-based storage engine. First, user data is divided into the following two parts:
Baseline data
Unlike other databases with LSM-tree-based storage engines, OceanBase Database introduces the concept of daily major compaction based on distributed multi-replica deployment. A tenant selects a global version on a regular basis or based on your operation, and a major compaction is initiated for all replicas of tenant data based on the version. This way, baseline data of the version is generated. The baseline data of the same version is physically consistent for all replicas.
Incremental data
All user data written after the latest version of baseline data is incremental data. Incremental data can be memory data written into MemTables or disk data compacted into SSTables. Replicas of incremental data are maintained independently and are not necessarily consistent. Unlike baseline data that is generated based on a specified version, incremental data contains multi-version data.
Random updates in columnar storage scenarios are controllable. On this basis, OceanBase Database provides a set of columnar storage implementation methods transparent to upper-layer business based on the characteristics of baseline data and incremental data:
Baseline data is stored by column, and incremental data is stored by row. Your DML operations are not affected, and upstream and downstream data is seamlessly synchronized. You can perform transaction operations on columnstore tables the same way as on rowstore tables.
In columnstore mode, the data of each column is stored as an independent SSTable, and the SSTables of all columns are combined into a virtual SSTable as baseline data for columnstore.
- Based on the storage mode that you specified when you created the table, baseline data can be stored by row, column, or both row and column (with redundancy).
In addition to implementing the columnar storage mode in the storage engine, OceanBase Database optimizes and adapts the optimizer, executor, storage, and other related modules for columnar storage. This facilitates you to migrate data from other AP databases or upgrade your OceanBase Database to use columnar storage to meet your AP needs. The migration or upgrade brings no effect to your business, and you can experience the same performance improvements as those brought by row-based storage after you switch to columnar storage. This way, OceanBase Database integrates TP and AP and provides a hybrid transaction/analytical processing (HTAP) engine to support different types of business with one engine and one set of code.
Integrated SQL
OceanBase Database provides a new cost model and statistics for columnar storage. The optimizer automatically selects a plan based on costs and the storage mode of data tables.
OceanBase Database provides a new vectorized execution engine for critical operators, and adaptively determines whether to use vectorized execution and the batch size for vectorized execution.
Integrated storage
You can specify the columnar storage, row-based storage, or hybrid row-column storage mode for a table based on the type of its business load. The storage mode does not affect operations, such queries and backup and restore, on the table.
Columnstore tables support all online and offline DDL operations, data types, and secondary indexes. You can use columnar storage the same way you use row-based storage.
Integrated transactions
Incremental data is stored by row. The logic of recording logs for data changes made in transactions and that of multiversion concurrency control (MVCC) are consistent with those in row-based storage.
Core features
Adaptive major compactions
Major compactions in columnar storage are significantly different from those in row-based storage. Specifically, because incremental data is stored by row, it must be merged with baseline data before it is split and saved to a separate SSTable for each column. The compaction time and resource usage will significantly increase compared with those in row-based storage.
To speed up major compactions of columnstore tables, OceanBase Database greatly optimizes the compaction process. Like rowstore tables, columnstore tables support horizontal splitting and parallel major compactions for speed-up. In addition, columnstore tables support vertical splitting. Major compactions for multiple columns in a columnstore table are combined into one major compaction task. The database can automatically increase or decrease the number of columns in a task based on system resources. This achieves balance between the major compaction speed and the memory overheads.
Column encoding algorithm
Data stored in OceanBase Database undergoes two stages of compression: hybrid row-column encoding provided by OceanBase Database and general compression. As a built-in algorithm of the database, hybrid row-column encoding supports direct queries without decompression. It also supports speed-up of queries, especially AP queries, based on encoded information.
Hybrid row-column encoding is designed mainly for row structures. Therefore, OceanBase Database provides a new column encoding algorithm for columnstore tables. Compared with the original encoding algorithm, the new algorithm supports comprehensive vectorized execution of queries, supports single instruction, multiple data (SIMD) optimization for compatibility with various instruction sets, and greatly increases the compression ratio for numeric types. This way, the new algorithm makes great improvements in terms of the performance and compression ratio.
Skip index
Regular columnstore databases pre-aggregate column data at a specific granularity. The aggregation results are persisted along with the data. When you query or request to access column data, the database can filter data by pre-aggregated data. This significantly reduces data access overheads and I/O consumption.
OceanBase Database supports the skip index feature in the columnar engine. The data of each column is aggregated at the microblock granularity for calculating the maximum value, minimum value, and total number of NULLs. Then the aggregation and accumulation are performed upwards, layer by layer, to obtain values at the macroblock, SSTable, and larger granularities. When you initiate a query, the system continuously drills down to select aggregated values at the appropriate granularity based on the scan range for filtering and aggregated output.
Query pushdown
OceanBase Database preliminarily supports simple query pushdown since V3.2.x. OceanBase Database V4.x and later fully support vectorized storage and support more complex pushdown. In the columnar engine, the pushdown feature is further enhanced and expanded in the following aspects:
All query filters are pushed down. At the same time, the database further utilizes the skip index feature and encoded information for speed-up based on the filter type.
Regular aggregation functions are pushed down. Currently, the database can push down the
COUNT(),MAX(),MIN(),SUM(), andAVG()functions to the storage engine in scenarios without theGROUP BYoperator.The
GROUP BYoperator is pushed down. For a column with a small number of distinct values (NDV), theGROUP BYoperator can be pushed down to the storage for calculation. This can significantly speed up calculation based on the dictionary information in microblocks.
Use columnar storage
Create a columnstore table
When you create a table, you can specify WITH COLUMN GROUP(each column) to create the table as a columnstore table.
obclient> CREATE TABLE tt_column_store (c1 int PRIMARY KEY, c2 int , c3 int) WITH COLUMN GROUP (each column);
You can query the CREATE TABLE statement after the table is created.
obclient> SHOW CREATE TABLE tt_column_store;
The result is as follows:
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tt_column_store | CREATE TABLE `tt_column_store` (
`c1` int(11) NOT NULL,
`c2` int(11) DEFAULT NULL,
`c3` int(11) DEFAULT NULL,
PRIMARY KEY (`c1`)
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 WITH COLUMN GROUP(each column) |
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
Create a hybrid rowstore-columnstore table
If you want to achieve balance between AP business and TP business and can accept a specific degree of data redundancy, you can specify all columns in the WITH COLUMN GROUP syntax to enable redundancy of rowstore data.
obclient> CREATE TABLE tt_column_row (c1 int PRIMARY KEY, c2 int , c3 int) WITH COLUMN GROUP (all columns, each column);
You can query the CREATE TABLE statement after the table is created.
obclient> SHOW CREATE TABLE tt_column_row;
The result is as follows:
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tt_column_row | CREATE TABLE `tt_column_row` (
`c1` int(11) NOT NULL,
`c2` int(11) DEFAULT NULL,
`c3` int(11) DEFAULT NULL,
PRIMARY KEY (`c1`)
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 WITH COLUMN GROUP(all columns, each column) |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
Columnstore scan
Query whether the execution plan contains a columnstore scan plan
The columnstore table
tt_column_storeis used as an example.obclient> EXPLAIN SELECT * FROM tt_column_store;The result is as follows:
+--------------------------------------------------------------------------------------------------------+ | Query Plan | +--------------------------------------------------------------------------------------------------------+ | ================================================================= | | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| | | ----------------------------------------------------------------- | | |0 |COLUMN TABLE FULL SCAN|tt_column_store|1 |7 | | | ================================================================= | | Outputs & filters: | | ------------------------------------- | | 0 - output([tt_column_store.c1], [tt_column_store.c2], [tt_column_store.c3]), filter(nil), rowset=16 | | access([tt_column_store.c1], [tt_column_store.c2], [tt_column_store.c3]), partitions(p0) | | is_index_back=false, is_global_index=false, | | range_key([tt_column_store.c1]), range(MIN ; MAX)always true | +--------------------------------------------------------------------------------------------------------+ 11 rows in setIn the query result,
COLUMN TABLE FULL SCANin the plan indicates the range scan for the columnstore table.COLUMN TABLE GETin the plan indicates the get operation with a specified primary key on the columnstore table. Here is an example:obclient> EXPLAIN SELECT * FROM tt_column_store WHERE c1 = 1;The result is as follows:
+--------------------------------------------------------------------------------------------------------+ | Query Plan | +--------------------------------------------------------------------------------------------------------+ | =========================================================== | | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| | | ----------------------------------------------------------- | | |0 |COLUMN TABLE GET|tt_column_store|1 |14 | | | =========================================================== | | Outputs & filters: | | ------------------------------------- | | 0 - output([tt_column_store.c1], [tt_column_store.c2], [tt_column_store.c3]), filter(nil), rowset=16 | | access([tt_column_store.c1], [tt_column_store.c2], [tt_column_store.c3]), partitions(p0) | | is_index_back=false, is_global_index=false, | | range_key([tt_column_store.c1]), range[1 ; 1], | | range_cond([tt_column_store.c1 = 1]) | +--------------------------------------------------------------------------------------------------------+ 12 rows in setSpecify whether to perform columnstore scan for a hybrid rowstore-columnstore table by using hints
The optimizer determines whether to perform rowstore scan or columnstore scan for a hybrid rowstore-columnstore table based on costs. For example, for full table scan in a simple scenario, the system uses row-based storage for generating a plan by default.
The hybrid rowstore-columnstore table
tt_column_rowis used as an example.obclient> EXPLAIN SELECT * FROM tt_column_row;The result is as follows:
+--------------------------------------------------------------------------------------------------+ | Query Plan | +--------------------------------------------------------------------------------------------------+ | ======================================================== | | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| | | -------------------------------------------------------- | | |0 |TABLE FULL SCAN|tt_column_row|1 |3 | | | ======================================================== | | Outputs & filters: | | ------------------------------------- | | 0 - output([tt_column_row.c1], [tt_column_row.c2], [tt_column_row.c3]), filter(nil), rowset=16 | | access([tt_column_row.c1], [tt_column_row.c2], [tt_column_row.c3]), partitions(p0) | | is_index_back=false, is_global_index=false, | | range_key([tt_column_row.c1]), range(MIN ; MAX)always true | +--------------------------------------------------------------------------------------------------+ 11 rows in setYou can also use the
USE_COLUMN_TABLEhint to forcibly perform columnstore scan for thett_column_rowtable.obclient> EXPLAIN SELECT/*+ USE_COLUMN_TABLE(tt_column_row) */ * FROM tt_column_row;The result is as follows:
+--------------------------------------------------------------------------------------------------+ | Query Plan | +--------------------------------------------------------------------------------------------------+ | =============================================================== | | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| | | --------------------------------------------------------------- | | |0 |COLUMN TABLE FULL SCAN|tt_column_row|1 |7 | | | =============================================================== | | Outputs & filters: | | ------------------------------------- | | 0 - output([tt_column_row.c1], [tt_column_row.c2], [tt_column_row.c3]), filter(nil), rowset=16 | | access([tt_column_row.c1], [tt_column_row.c2], [tt_column_row.c3]), partitions(p0) | | is_index_back=false, is_global_index=false, | | range_key([tt_column_row.c1]), range(MIN ; MAX)always true | +--------------------------------------------------------------------------------------------------+ 11 rows in setSimilarly, you can use the
NO_USE_COLUMN_TABLEhint to forcibly forbid columnstore scan for the table.obclient> EXPLAIN SELECT /*+ NO_USE_COLUMN_TABLE(tt_column_row) */ c2 FROM tt_column_row;The result is as follows:
+------------------------------------------------------------------+ | Query Plan | +------------------------------------------------------------------+ | ======================================================== | | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| | | -------------------------------------------------------- | | |0 |TABLE FULL SCAN|tt_column_row|1 |3 | | | ======================================================== | | Outputs & filters: | | ------------------------------------- | | 0 - output([tt_column_row.c2]), filter(nil), rowset=16 | | access([tt_column_row.c2]), partitions(p0) | | is_index_back=false, is_global_index=false, | | range_key([tt_column_row.c1]), range(MIN ; MAX)always true | +------------------------------------------------------------------+ 11 rows in set