In scenarios involving complex analysis of large-scale data or ad-hoc queries on massive data, columnar storage is a key capability of an AP database. Unlike row-based storage, which organizes data by rows, columnar storage organizes data by columns. This approach allows analytical queries to scan only the relevant columns, avoiding the need to scan entire rows. As a result, it reduces I/O and memory usage, thereby improving computational speed. Additionally, columnar storage inherently offers better conditions for data compression, enabling higher compression ratios, which reduces storage space and network bandwidth.
OceanBase Database, based on the LSM-Tree (Log Structured Merge-Tree) data storage structure, continuously optimizes its high-concurrency transaction processing (TP) capabilities. This allows it to enhance performance in scenarios such as random writes, real-time updates, and strong consistency. Through extensive engineering experience, OceanBase has built a self-controlled storage engine technology system. The hierarchical merge feature and static data organization capability of LSM-Tree make it inherently suitable for OLAP scenarios characterized by batch writes and infrequent updates. By employing columnar data compression, hierarchical merge strategies, and storage fragmentation optimization, OceanBase meets the efficient scanning needs of analytical workloads while supporting a unified architecture for both TP and AP workloads.
In V4.3, building on its existing technology, OceanBase Database continues to expand its storage engine to support columnar storage, achieving storage integration. This means a single codebase, architecture, and OBServer instance can handle both columnar and row-based data. This integration ensures that OceanBase Database can effectively balance the performance of TP and AP queries.
Columnar Engine Architecture
OceanBase Database, as a native distributed database, stores user data in multiple replicas by default. To leverage the benefits of multi-replica storage, including strong data verification and data migration reuse, a custom-designed LSM-Tree storage engine has been developed with various optimizations. Specifically, user data can be divided into two parts: baseline data and incremental data.
Baseline data
Globally consistent version control: OceanBase Database breaks away from the traditional LSM-Tree design paradigm. Based on the distributed multi-replica architecture, OceanBase Database implements the "daily major compaction" mechanism. The system can periodically or on-demand trigger the selection of a global version number. Tenants will periodically or select a global version number based on user operations. All replicas then perform a major compaction based on this version, generating a baseline for the version. As a result, all replicas have the same baseline data for the same version.
Multi-mode storage support: The baseline data supports row-oriented, column-oriented, and hybrid storage formats. Users can choose the appropriate format for their specific business scenarios through table creation settings.
Incremental data
- Dynamic multi-version management: All writes after the latest baseline version, including data written to the MemTable and data that has been flushed to an SSTable, are considered incremental data. Each replica independently maintains its own multi-version records, which do not need to be consistent with one another. Unlike baseline data, which is generated based on a specified version, incremental data includes all multi-version data.
- Prefer rowstore strategy: Incremental data is forced to use the rowstore mode to ensure that the transaction processing (TP) link is fully compatible with the native rowstore architecture, sharing key components like transaction logs and lock mechanisms.

Row and column storage hybrid architecture
Taking advantage of the characteristics of columnar storage scenarios, OceanBase Database leverages its own baseline and incremental data features to propose a columnar storage implementation that is transparent to the upper layers:
- Decoupling of storage form: Baseline data is organized in columnar format, with each column stored as a separate SSTable, logically grouped into tables. Incremental data is stored in row format, and DML operations are fully synchronized without any impact on the upstream and downstream processes.
- Dynamic Sync Engine: Builds a bidirectional sync pipeline for row and column data, supporting smooth migration/upgrade of OLAP systems with rowstore upgrades. Business operations remain unaffected by storage format differences.
- **Smart routing: **The system automatically selects the optimal row or column access paths based on load characteristics, from the optimizer to the executor. This fully exploits the columnar storage performance advantages in AP scenarios while preserving native support for TP transactions in row stores.
Key Integration Capabilities
Dimension |
Key Technology |
|---|---|
| SQL Integration |
|
| Storage integration |
|
| Transaction integration |
|

Main characteristics of columnar storage
Characteristic 1: Adaptive compaction
The data consolidation behavior of the new columnar storage mode is significantly different from that of row-based storage. All incremental data is row-based and needs to be merged with baseline data and then split into independent SSTables in each column. This leads to a substantial increase in consolidation time and resource usage compared with row-based storage. To accelerate the consolidation of columnar tables, the storage layer has optimized the compaction process. In addition to horizontal splitting for parallel merging, which is similar to row-based tables, columnar tables also support vertical splitting for accelerated merging. Multiple column consolidation tasks are combined into one task, and the number of columns in a task can be flexibly adjusted based on system resources to maintain a good balance between consolidation speed and memory usage.
Feature 2: Columnar encoding algorithm
OceanBase Database applies compression to data stored in two levels. At the first level, it applies the row-column hybrid encoding compression technology it has developed independently. At the second level, it applies a generic compression technology. Because the row-column hybrid encoding compression is an algorithm built in the database, it allows direct query without decompression and enables query filtering acceleration by using encoding information. However, the original row-column hybrid encoding compression algorithm is still based on rows. Therefore, OceanBase Database implements a new columnar encoding algorithm for columnar tables. Compared with the original algorithm, the new algorithm supports complete vectorized execution of queries, supports SIMD optimization across instruction sets, and significantly improves the compression ratio for numeric types, thereby achieving a significant improvement in both performance and compression ratio compared with the original algorithm.
Feature 3: Skip Index
Common columnar databases typically precompute aggregated values for each column at various granularities and persist these results alongside the data. When users query specific columns, the database can filter data using these precomputed aggregated values, significantly reducing data access overhead and minimizing unnecessary I/O. In the columnar engine, we also introduced skip index support, allowing for multi-dimensional aggregation of each column's microblock, including maximum value, minimum value, and total null count. These aggregations are then cumulatively computed and aggregated into larger units such as macroblocks and SSTables. Users can then drill down based on their query's scan range to select the most appropriate granular aggregated value for data filtering and output.
Feature 4: Query pushdown
Starting from V4.x, the storage layer's operators and expressions in OceanBase Database have been comprehensively adapted for vectorized execution, and some scenarios support query pushdown. The columnar engine further enhances and extends query pushdown functionality, including:
All queries are filtered, and the filters can be further accelerated using skip indexes and encoded information based on their types.
Supported to push down commonly used aggregate functions for scenarios where the group by condition is not specified. The count, max, min, sum, and avg aggregate functions can be pushed down to the storage engine.
Supports the storage-side pushdown of GROUP BY operations. It significantly improves the performance of GROUP BY operations on columns with a small number of distinct values (NDVs) by leveraging the dictionary information within microblocks.
For more information about columnstore, see Columnstore engine.
Shared storage mode
OceanBase Database is supported in two deployment modes: Shared-Nothing (SN) mode and Shared-Storage (SS) mode.
Shared-Storage mode is an architecture where data is centrally stored on shared storage devices and can be accessed by multiple database nodes. This architecture is used in public clouds. It is primarily used to simplify storage management, improve resource utilization, and support flexible high-availability failover in specific scenarios.
Reduce storage redundancy: In a traditional shared-nothing architecture, each node maintains a separate copy of the data, leading to potential waste of storage resources. In contrast, shared storage allows multiple computing nodes to access the same data, reducing the number of copies and thereby lowering overall storage costs.
Fast failure recovery: When a compute node fails, other nodes can access data from shared storage, avoiding the need for complex replication synchronization or migration processes, enabling faster service takeover.
Simplified O&M management: A unified storage pool facilitates backup, snapshots, and monitoring, enhancing the maintainability of the database system.
This section describes the features of the dual-replica architecture of OceanBase Database in a separation-of-compute-and-storage mode. This architecture allows you to:
- Cost-efficient: Full data is stored in object storage, hot data is cached in local disks, thus lowering storage costs while guaranteeing P99 query performance. It features elastic scaling, with compute and storage resources that can be scaled independently. In comparison to the Shared-Nothing architecture, it has an average performance loss of 0.3%~1.7%.
- High Availability: A 2F dual-replica deployment ensures high availability for computing nodes. Combined with an independent log storage service based on Paxos, the system guarantees RPO=0 and RTO<8s.
The decoupled architecture of OceanBase Database based on single-tenant instances is capable of:
Extreme cost-effectiveness: Full data is stored in low-cost object storage, with hot data cached in local disks. It uses a single-replica format, reducing the computing costs by 2-3 times compared with 2F1A and 3F clusters. It also supports fast addition and deletion of computing nodes within a single zone to meet changing performance requirements.
High availability: Objects stored in the same city are redundant. This feature supports disaster recovery at the data center level. When a single-replica system fails, it can quickly start a new compute node with the help of a standalone log service based on Paxos and shared storage.
Vectorized execution engine
Vectorized execution is an efficient technique for batch processing data. In analytical queries, vectorized execution can significantly improve execution performance. OceanBase Database introduced the vectorized execution engine in V3.2, but it is disabled by default. Starting from OceanBase Database V4.0, the vectorized execution engine is enabled by default. In OceanBase Database V4.3, the vectorized engine 2.0 was implemented, which greatly improved the execution performance of the vectorized engine through optimizations in data formats, operator implementations, and storage vectorization.

Overview of data formats
In vectorized engine 2.0, a new columnar data format was introduced. The data description information (null, len, ptr) is stored separately for each column, avoiding redundant storage. Three data formats were designed for different data types and use cases: fixed-length data format, variable-length discrete format, and variable-length continuous format.
- Fixed-length data format: The length value is stored only once, eliminating the need for redundant storage. Direct access is possible, resulting in better data locality. Compared to version 1.0, this format saves space, improves efficiency, and eliminates the need for pointer swizzling operations.
- Variable-length discrete format: Each data element may not be contiguous in memory. Each data element is described by an address pointer and length. This format avoids deep copying during encoding and is suitable for short-circuit computation scenarios, eliminating the need for data reorganization.
- Variable-length continuous format: Data is stored contiguously in memory, with length information and offset addresses described by an offset array. This format improves data access efficiency but requires reorganization and deep copying during short-circuit computation and columnar encoding projection. This format is primarily used for columnar materialization scenarios.
Performance optimization of operators and expressions
Vectorized engine 2.0 comprehensively optimized operators and expressions. The core idea is to leverage the information from the new format and specialized data structures to reduce CPU cache misses and instruction overhead, thereby improving overall execution performance. The main optimizations include:
- Utilizing batch data attribute information: Maintain batch data feature information to eliminate special handling and filtering for NULL values, optimizing SIMD calculations.
- Algorithm and data structure optimization: Optimized intermediate result materialization structures to support row-wise and column-wise materialization. The Sort operator separates sort keys from non-sort keys for materialization, reducing cache misses during sorting and improving overall efficiency.
- Specialized implementation optimization: Optimized for specific scenarios, such as encoding multiple fixed-length join keys into a single fixed-length column and specialized implementations for aggregation calculations, significantly improving execution efficiency.
Storage vectorization optimization
The storage layer fully supports the new vectorized formats. SIMD acceleration is used for projection, predicate pushdown, aggregation pushdown, and groupby pushdown operations. Customized templates are used for projection based on column types and lengths, simplifying calculations. Predicate calculations are directly performed on column-encoded data, enabling rapid computation of complex expressions. Aggregation pushdown leverages pre-aggregated information from intermediate layers for efficient processing of statistical functions, while groupby pushdown significantly accelerates performance by utilizing encoded data information.
Overall performance, the vectorized execution engine in OceanBase Database significantly outperforms the non-vectorized engine, with performance improvements of an order of magnitude:
Real-time write
OceanBase Database adopts the Log-Structured Merge-Tree (LSM-Tree) architecture, which ensures the real-time write capability of the database. This topic describes the real-time write capability of OceanBase Database.
Core storage mechanism
LSM-Tree
To support efficient real-time writes, OceanBase Database stores incremental data in the LSM-Tree structure. The LSM-Tree is a special tree structure designed to optimize write operations. Its core idea is to record write operations in an in-memory structure and asynchronously batch write them to disk when a certain amount is reached. This significantly reduces the number of disk I/O operations and improves write performance. OceanBase Database periodically merges incremental data with baseline data through dumping and compaction processes to ensure data consistency and integrity.
Real-time write capability
The LSM-Tree architecture of OceanBase Database ensures excellent performance in handling real-time data writes. Whether it's small-scale data updates or large-scale data imports, OceanBase Database can quickly respond and ensure real-time writes. Its main features are as follows:
- Efficient write processing: OceanBase Database processes write operations in the LSM-Tree, reducing disk operations and improving write efficiency.
- Immediate query availability: Data written to the LSM-Tree's in-memory structure can be queried immediately, ensuring real-time data availability.
- Optimized data compaction process: Intelligent dumping and compaction strategies support efficient queries.
- Strong concurrent processing capability: Leveraging a distributed architecture, OceanBase Database can parallelly process write operations across multiple nodes, significantly enhancing real-time data processing capabilities.

Materialized views
Materialized views (MV) is a key feature for AP business scenarios. It improves query performance and simplifies complex query logic by precomputing and storing the query results of views, thereby reducing real-time calculations. Materialized views are commonly used for rapid report generation and data analysis scenarios.
OceanBase Database supports asynchronous materialized views. That is, when the data of the base table changes, the data of the materialized view is not updated immediately. This ensures the execution performance of DML operations on the base table, but the data of the materialized view will be delayed compared with that of the base table. The data of the materialized view can be updated through complete refresh or incremental refresh.
- Complete refresh directly re-executes the query statement corresponding to the materialized view, recalculates, and overwrites the original view result data. It is suitable for scenarios with low delay requirements, low data update frequency of the base table, or small data volume. For example, it can be used for data summary reports that are updated on a daily or weekly basis.
- Incremental refresh processes only the data changes since the last refresh, which greatly reduces the time and resources required for refreshes. To ensure precise incremental refreshes, OceanBase Database provides a materialized view log (MLOG) mechanism similar to that in Oracle databases. It records the incremental update data of the base table to ensure that the materialized view can be quickly incrementally refreshed. Incremental refreshes are suitable for scenarios with high delay requirements, large data volume, and frequent data changes. For example, in a real-time transaction system, data may change every minute or even every second.
OceanBase Database also supports real-time materialized views for real-time data analysis. Real-time materialized views use the materialized view log mechanism to capture and process changes in base table data. During queries, it calculates and integrates data changes online. Even if the materialized view does not physically store the latest change data, it can ensure that the query results are consistent with those obtained by directly querying the base table. At the same time, it uses the query rewriting capability of materialized views to achieve transparent query acceleration.
You can specify a primary key or create indexes for materialized views to optimize the performance of single-row lookups, range queries, or join operations based on the primary key or index. If a materialized view is a large wide table formed by multi-table JOINs, you can create a columnar storage materialized view to improve query performance in some cases. You can also create partitioned materialized views to reduce the amount of data operated on by using partition pruning.
OceanBase Database V4.3.5 and later support nested materialized views (Nested Materialized View, Nested MV). With this feature, you can easily build new materialized views based on existing materialized views. This feature is suitable for the ETL process in data warehouses. Specifically, nested materialized views can generate intermediate results during the data conversion and loading phases. These results can serve as input for subsequent processing, further optimizing the entire data processing workflow.
For more information about materialized views, see Overview (MySQL mode) and Overview (Oracle mode).
Optimizer
The query optimizer of OceanBase Database is designed for HTAP hybrid workloads and real-time analytics. Typically, transactional workloads have the following characteristics: small data access per query, high response time (RT) requirements, and higher throughput. Analytical workloads, on the other hand, involve large data access per query and relatively lower throughput. The execution plans and optimization methods required for different types of workloads also vary significantly.
- For transactional workloads, it is typically necessary to create appropriate index structures for filtering predicates and join predicates. Each table should select the most suitable index path to significantly reduce data scanning.
- For analytical workloads, full-table scans are usually performed using columnar storage, and skipindex is used to quickly skip the scanning of some data blocks. The response time of queries can be reduced by increasing the parallelism.
The OceanBase AP optimizer is built upon the capabilities of the TP optimizer and has been specifically enhanced for various complex query scenarios.
More comprehensive query rewriting capabilities: The query rewriting module supports a wide range of rewriting algorithms. Different rewriting algorithms are matched to different scenarios and perform equivalent transformations to improve the business SQL queries. Currently, the optimizer supports rule-based rewriting algorithms such as view merging, subquery lifting, inner join elimination, outer join elimination, and elimination of always-true or always-false conditions. It also supports cost-based rewriting algorithms such as OR expansion, JA subquery lifting, Win Magic, and Group-By Placement.
Distributed plan generation: OceanBase Database uses a one-phase approach for distributed plan generation, which is different from the two-phase approach commonly used in other database systems. In the two-phase approach, an optimal single-machine plan is first generated and then distributed to simplify the optimizer's design. However, this approach faces more challenges in HTAP scenarios. For example, in a single-machine scenario, the optimal NESTED LOOP JOIN algorithm may generate a large amount of cross-node data access after being distributed, resulting in significantly lower execution efficiency compared to the distributed HASH JOIN algorithm. OceanBase Database's one-phase plan generation framework considers factors such as data distribution characteristics and parallelization during the enumeration of join orders and join algorithms. It comprehensively evaluates various factors to select the overall optimal distributed execution plan.
Row and column path selection: To effectively handle HTAP hybrid workloads, the query optimizer employs an intelligent optimization capability for automatic row and column path selection. This means that for a given query, the optimizer intelligently decides whether to read data from row store (Row Store) or column store (Column Store) to achieve optimal execution performance. OceanBase Database has designed a specific cost model for column store scans and introduced a new statistical information mechanism to accurately evaluate the benefits of column store SkipIndex for column store scans. This ensures accurate decisions between row store indexes and column store scans.
Easier-to-use Auto DOP: While databases typically use parallel execution to accelerate complex SQL queries, it is challenging to easily assess whether to enable parallel execution and the appropriate degree of parallelism in real-world scenarios. OceanBase Database provides the Auto DOP feature, which allows the optimizer to evaluate the required execution time of a query and automatically determine whether to enable parallel execution and the appropriate degree of parallelism, ensuring that SQL queries can achieve optimal performance by default.
Execution plan management: OceanBase Database's SQL Plan Management (SPM) technology is a key technology for ensuring the long-term stability of real-time analytics. It intelligently manages the evolution of execution plans to address scenarios where the optimizer may select a worse new execution plan after data volume changes, statistical information updates, or database version upgrades. It prevents execution plan regression through gray-scale validation using real traffic.
For more information, see Overview of statistics and cost-based optimization.
Smart materialized views
The core value of smart materialized views lies in the fact that users only need to define the data processing results through SQL, and OceanBase Database automatically manages data refreshes and dependency calculations. This means that enterprises do not need to write complex ETL scripts or manage data pipelines. Instead, they just need to define the target data structure, and OceanBase Database will automatically handle data refreshes, dependency management, and performance optimization. This approach greatly simplifies the data engineering process, reduces operational costs, and ensures data freshness and consistency.
OceanBase Database supports asynchronous materialized views. When the data in the base table changes, the materialized view is not immediately updated. This ensures the performance of DML operations on the base table. However, the data in the materialized view may be delayed relative to the base table, and it needs to be refreshed and updated in a timely manner. The update of materialized view data supports two strategies: full refresh and incremental refresh.
- A full refresh re-executes the query statement corresponding to the materialized view, fully calculates, and overwrites the original view result data. This strategy is suitable for scenarios with low latency requirements, low base table data update frequency, or small data volumes. For example, data summary reports that are updated once a day or a week.
- An incremental refresh processes only the data changes since the last refresh. This approach significantly reduces the time and resources required for refreshes. To achieve precise incremental refreshes, OceanBase Database implements a materialized view log feature similar to Oracle's MLOG (Materialized View Log). This feature records detailed incremental updates of the base table, ensuring that the materialized view can quickly perform incremental refreshes. The incremental refresh strategy is particularly suitable for scenarios with high latency requirements, large data volumes, and frequent changes. For example, in real-time transaction systems, data may change every minute or even every second.
OceanBase Database also supports real-time materialized views for real-time data analysis. Real-time materialized views use the materialized view log mechanism to capture and process changes in the base table data. During queries, the system calculates and integrates the data changes in real time, ensuring that the query results are consistent with those obtained by directly querying the base table, even if the materialized view does not physically store the latest changes. Additionally, the query rewriting capability of materialized views enables transparent query acceleration.
OceanBase Database allows you to specify a primary key or create an index for a materialized view, optimizing the performance of single-row lookups, range queries, and join operations based on the primary key or index. If a materialized view is a large wide table formed by joining multiple tables, you can create a columnar storage materialized view to improve the performance of certain queries. Furthermore, by creating partitioned materialized views, you can leverage partition pruning to reduce the amount of data processed.
In the latest OceanBase Database V4.3.5 version, the Nested Materialized View (Nested MV) feature was introduced. This feature allows you to easily build new materialized views based on existing ones, which is particularly useful in data warehouse ETL processes. Nested materialized views can generate intermediate results during data transformation and loading, which can then be used as input for subsequent processing steps, further optimizing the efficiency of the entire data processing workflow.
For more information about materialized views, see Overview of materialized views (MySQL mode) and Overview of materialized views (Oracle mode).
Key features of smart materialized views:
- Declarative SQL definition: Smart materialized views allow users to declaratively define data processing results without manually managing transformation steps. Users simply write standard SQL queries to specify data transformation logic, and OceanBase Database handles the execution and maintenance of these queries.
- Automatic refresh mechanism: OceanBase Database automatically handles the cascading relationships of data refreshes, including scheduling and execution, based on the user-specified target data freshness requirements. Users only need to define the desired level of data freshness, such as a 30-second or 5-minute target lag, and OceanBase Database will automatically ensure the data meets this requirement.
- Optimized incremental processing: Smart materialized views use automatic incremental view maintenance technology, calculating only the changes since the last refresh instead of performing a full refresh. This significantly improves performance and reduces computational costs.
- Dependency management: Smart materialized views automatically track data dependencies, ensuring that refreshes occur in the correct order when base data changes. This eliminates the need to manually manage complex data pipeline dependencies.
Multimodal data types
OceanBase Database is a distributed, multimodal, and integrated database. In real-time analytics scenarios, it can process structured, semi-structured, and unstructured data simultaneously within a single database using the SQL engine.
The storage foundation for multimodal data is LOB (Large Object). LOB plays a crucial role in AI storage and data processing. In AI scenarios, multimodal data such as images, text, audio, and video are all large objects. The performance of LOB significantly impacts the computational efficiency of AI pre-training and other scenarios. OceanBase Database provides an efficient LOB implementation at the storage layer. SQL supports 512 MB of storage, and the DBMS.Lob package enables efficient access to large objects up to the TB level.
In data analysis scenarios, OceanBase Database natively supports multimodal data types such as Array, RoaringBitmap, and Map. It can efficiently store and query structured data and directly support semi-structured data and complex aggregations.
- RoaringBitmap/Map: Suitable for advanced data mining scenarios such as large-scale tag analysis, audience targeting, and deduplication aggregation.
- Array: Provides great flexibility for complex business scenarios such as log retrieval, behavior tracking, and multi-dimensional tagging, truly enabling a single database to handle diverse workloads.
For detailed information about the efficient compressed bitmap data type (RoaringBitmap), see Efficient compressed bitmap data type (RoaringBitmap).
For detailed information about the array data type, see Array data type.
JSON, as the most widely used semi-structured data type, has extensive applications in transaction processing, analytics, and even AI scenarios.
- In transaction processing scenarios, JSON serves as a schemaless, elastic columnar storage and computation engine.
- In analytics scenarios, JSON's multi-value indexing supports flexible calculations for multi-dimensional tags.
- In AI scenarios, JSON acts as a bridge between models and applications, and it also serves as a standard input/output format in data processing pipelines such as agents and workflows.
To support JSON across various workloads, OceanBase Database provides rich computational expressions and JSON-based multi-value indexing. At the storage layer, it supports JSON Binary, optimizes random read/write operations for JSON data, and enables structured encoding for JSON. By extracting structured information from similar JSON data, it significantly improves the storage compression ratio and query performance based on JSON paths. For detailed information about the JSON data type, see JSON data type.
In the era of generative AI, the processing of multimodal data has become increasingly important. OceanBase Database enhances its vector and full-text search capabilities to better support hybrid search requirements in knowledge retrieval scenarios. Based on a fully self-developed approach combining vector algorithms and databases, OceanBase Database's vector capabilities outperform those of open-source vector databases. In VectorDBbench, OceanBase Database outperforms typical open-source competitors.
Special indexes
In traditional relational databases, indexes such as B-trees and hash indexes are primarily used for exact value queries on structured data (such as numbers and dates). However, in practical business scenarios, with the increasing prevalence of semi-structured data (such as JSON documents), unstructured text (such as logs and long texts), and multidimensional analysis, OceanBase Database provides two types of special indexes:
- Full-text indexes: Implemented based on inverted indexes, these indexes use tokenization technology to build keyword mappings for text content. They are suitable for log analysis, document retrieval, and other scenarios.
- Multi-value indexes: These indexes are built at the element level for JSON array fields. They expand arrays into virtual row records and construct B-tree indexes, significantly improving the query efficiency for set data.
By designing differentiated data structures, these indexes provide targeted optimizations for different data types (text/JSON) and query modes (fuzzy matching), collectively forming an acceleration layer for complex data queries.
Full-text indexes
In relational databases, indexes are typically used to accelerate precise value matching queries. However, traditional B-tree indexes often fail to meet performance requirements when handling large volumes of text data and fuzzy searches. In such cases, performing a full table scan to find matching data row by row can lead to performance bottlenecks, especially when dealing with large amounts of text and data. Additionally, complex query requirements such as approximate matching and relevance sorting are difficult to achieve through simple SQL rewriting.
To address these issues, OceanBase Database currently supports full-text indexes compatible with MySQL. Full-text indexes preprocess text content to build keyword indexes, significantly improving the efficiency of full-text searches.
Full-text indexes are used to quickly search for text data. Their main features include:
- Full-text search: By building full-text indexes, you can comprehensively index entire documents or large segments of text, enabling more flexible and efficient searches.
- Quick lookup: Users can quickly find matching text in the database based on input keywords, greatly reducing search time.
- Efficient handling of large volumes of text: Full-text indexes can effectively handle various types of text data, including articles, reports, web pages, and emails, providing users with precise and fast search experiences.
- Support for complex queries: In addition to basic keyword searches, full-text indexes also support complex query requirements such as approximate matching and relevance sorting, greatly enriching the database's search capabilities.
By introducing full-text indexes, OceanBase Database can significantly improve query performance when dealing with large-scale text data and complex retrieval requirements, allowing users to efficiently obtain the information they need.
For more information about full-text indexes, see Full-text indexes (MySQL mode).
Multi-value indexes
Multi-value indexes are a special index feature in OceanBase Database's MySQL mode, primarily used for handling JSON documents and set data types. They are suitable for scenarios where multiple values or multiple attributes need to be queried. Their main features are as follows:
- You can create indexes on arrays or sets.
- Currently, they are applicable to JSON documents.
- They can improve the query efficiency of searches based on JSON array elements.
Multi-value indexes are particularly useful in AP (Analytics Processing) scenarios. AP scenarios typically involve complex data analysis and report generation, and multi-value indexes can accelerate these operations. For example, in a data warehouse, multi-value indexes can be used to accelerate multidimensional data analysis and improve the efficiency of report generation. Specific applicable scenarios are as follows:
- Many-to-many association queries: Multi-value indexes can optimize queries involving many-to-many relationships between entities. For example, in the relationship between actors and movies, you can store all actors of a movie in a JSON array and use a JSON multi-value index to quickly query all movies an actor has participated in.
- Tag and category queries: When entities have multiple tags or categories, multi-value indexes can accelerate relevant queries. For example, multiple tags of a product can be stored in a JSON array, and a JSON multi-value index can be used to quickly find products with specific tags.
For more information about multi-value indexes, see Multi-value indexes.
MySQL compatibility
OceanBase Database is highly compatible with the MySQL ecosystem, allowing for seamless migration of MySQL applications and leveraging existing OLAP tools and technologies. This compatibility ensures that businesses can quickly adapt and innovate in data analysis and business insights.
Syntax compatibility: OceanBase Database fully supports MySQL's SQL standard syntax, including Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). This means that most MySQL queries, table definitions, index creation, and permission management statements can run in OceanBase Database with minimal syntax adjustments, significantly reducing migration costs and the learning curve.
- Seamless migration: Existing MySQL applications can be quickly migrated to OceanBase Database, minimizing code changes during the process.
- Skill reuse: MySQL developers and DBAs do not need to learn new database syntax, shortening the adaptation period.
- Ecosystem integration: The compatibility with MySQL syntax allows OceanBase Database to integrate smoothly with existing BI, ETL, and data visualization tools.
View compatibility: OceanBase Database is compatible with MySQL's information_schema views, such as:
- Table information queries: Supports views like TABLES and COLUMNS, allowing users to query the structure and column information of all tables in the database. This is crucial for data dictionary management and integration with third-party tools.
- Permission management: Supports views like SCHEMATA and SCHEMA_PRIVILEGES, helping administrators easily view and manage database and table permissions.
Many database management, monitoring, and analysis tools rely on INFORMATION_SCHEMA to retrieve database status and schema information. OceanBase Database's compatibility with this schema allows these tools to run directly on OceanBase Database without customization. OceanBase Database supports various OLAP tools, such as:
OceanBase Database supports a full stack of data integration from batch to stream processing:
- Seamlessly integrates with mainstream tools like Flink, Kafka, OMS, OBLOADER, DataWorks, and dbt, making it easy to connect to existing ETL, real-time synchronization, and data warehouse environments.
- The database kernel supports features like direct load, external tables for files/ODPS/HDFS, and External Catalog, significantly lowering the development and operational barriers for multi-source data integration and data lake linkage.
- Highly compatible with MySQL and Oracle syntax, allowing mainstream BI, analytics, and stream processing tools (such as Tableau, PowerBI, Flink, and Quick BI) to be migrated with zero code changes, achieving "out-of-the-box" functionality.
OceanBase Database not only innovates in its core engine but also integrates with many ecosystem products:
- Automated orchestration: Deeply integrated with scheduling platforms like DolphinScheduler, n8n, and Airflow, enabling automatic orchestration and maintenance of multi-stage data processing tasks, enhancing large-scale data governance and multi-system collaboration efficiency.
- Observability: Seamlessly integrated with monitoring tools like Prometheus and Grafana, supporting real-time performance monitoring, alerts, and intelligent operations for multi-tenant, node, and cluster environments, facilitating stable operation of large-scale clusters and multi-business lines.
- Data visualization empowerment: Fully compatible with mainstream BI platforms like Superset, Tableau, QuickBI, and Guan Yuan BI, supporting self-service analysis for multiple roles, interactive dashboards, and complex business reports, driving business insights and data-driven decisions.
For more information about OceanBase Database's current OLAP ecosystem integration, see Ecosystem integration.
