Columnar storage engine
In scenarios involving large-scale data complex analysis or ad-hoc queries on massive data, columnar storage is a key capability of an AP database. Columnar storage is a data organization method that differs from row-based storage by arranging table data by columns. When data is stored in a columnar format, analysis scenarios can scan only the relevant columns for query computations, avoiding full-row scans. This reduces resource usage such as I/O and memory, thereby improving computational speed. Additionally, columnar storage inherently offers better conditions for data compression, making it easier to achieve high compression ratios, thus reducing storage space and network transmission bandwidth.
OceanBase Database is built on the LSM-Tree (Log Structured Merge-Tree) data storage structure. By continuously optimizing its high-concurrency transaction processing (TP) capabilities, it has enhanced performance in scenarios such as random writes, real-time updates, and strong consistency. This has accumulated extensive engineering experience and established a self-controlled storage engine technology system. At the same time, the hierarchical merge feature and static data organization capability of LSM-Tree make it naturally suitable for OLAP scenarios characterized by batch writes and low-frequency updates. Through columnar data compression, hierarchical merge strategies, and storage fragmentation optimization, it meets the efficient scanning needs of analytical workloads while supporting unified architecture for TP and AP mixed workloads.
In V4.3, based on its existing technology, OceanBase Database continues to expand its storage engine to support columnar storage, achieving storage integration. This means that a single codebase, architecture, and OBServer can handle both columnar and row-based data. This approach truly balances the performance of TP and AP queries.
Columnar Engine Architecture
OceanBase Database, as a native distributed database, stores user data with multiple replicas by default. To leverage the advantages of multiple replicas, providing data integrity checks and enabling data migration reuse are key enhancements. Our self-developed LSM-Tree storage engine also includes extensive targeted design. First, user data can be divided into two parts: baseline data and incremental data.
Baseline Data
Global consistency version control: OceanBase Database, leveraging its distributed multi-replica architecture, has broken away from conventional LSM-tree design principles by implementing the 'merge daily' mechanism. It triggers the global version number selection regularly or on demand. Tenants then choose a global version number either periodically or based on user operations. All replicas perform major compactions based on this version to generate baseline data that is physically consistent across all replicas of the same version.
Polytype storage: Baseline data supports row, column, and mixed row-column storage. Users can flexibly select the appropriate storage format through table creation, meeting the storage needs of various business scenarios.
Incremental data
- Dynamic multi-version management: Any writes after the latest baseline version (including newly written data in the memory MemTable and data that has been flushed to disk SSTables) are categorized as incremental data. Each replica maintains its own multi-version records independently, without maintaining consistency among them. Unlike baseline data, which is generated based on a specified version, incremental data contains all multi-version data.
- RowStore Preference Policy: Incremental data must be stored using the rowstore mode. This ensures full compatibility between the TP (Transaction Processing) link and the native rowstore architecture, allowing for shared use of core components such as transaction logs and locking mechanisms.

Hybrid row-columnar storage architecture
Based on its characteristics, OceanBase Database, which has baseline and incremental data features, proposes a columnar storage implementation that is transparent to the upper layers.
- Decoupling of storage formats: The baseline data is organized in a columnar storage format (each column is stored as an independent SSTable, logically combined into a table), while incremental data remains row-oriented. DML operations and synchronization with upstream and downstream systems are fully transparent.
- Dynamic Synchronization Engine: Builds bidirectional data synchronization pipelines at the row and column levels, enabling smooth transitions for OLAP system migrations and row-store upgrades, without requiring applications to be aware of storage format differences.
- Intelligent routing mechanism: The system automatically selects the optimal row/column access paths based on the workload characteristics from the optimizer to the executor. This leverages the performance advantages of columnar storage in analytical processing (AP) scenarios while retaining native support for transactional processing (TP) in row-based storage.
Key Integrated Capabilities
Capability dimensions |
Key technical realizations |
|---|---|
| Unified SQL engine |
|
| Storage Integration |
|
| Transation Inlining |
|

Columnar replicas
To meet the physical resource isolation requirements in mixed workloads for hybrid TP/OLAP scenarios, OceanBase Database provides support for Column Store Replicas (C-replicas). A C-replica is a new replica type with read-only characteristics. It stores baseline data of user tables only in columnar format. A C-replica is deployed in an independent zone. ODP provides access to a C-replica and supports weak-consistency reads. This way, you can leverage the columnar batch processing advantages of a C-replica to accelerate queries. However, this does not affect existing OLTP services. In an 2F1A configuration, a 2F1A1C architecture not only meets the physical isolation requirements for TP and AP services but also reduces storage overhead by one replica compared with the 2F1A configuration where all tables are stored in row-column format.
Columnar read-only replicas follow the same rules for replica distribution and weak read release mechanism as regular read-only replicas. The main difference lies in the storage structure of the basic data. Like regular read-only replicas, columnar replicas do not participate in leader election or log sync voting. They, however, fully include core components such as static data tables, commit logs, and in-memory data tables.
For more information about columnstore replicas, see Columnstore replica.
Major features of columnar storage
Feature 1: Adaptive Compaction
With the introduction of the new columnar storage mode, the behavior of data major compactions has significantly changed compared to rowstore data. Since all incremental data is stored in rowstore format, major compactions need to combine this data with baseline data and then split it into independent SSTables for each column. This results in a notable increase in major compaction time and resource usage compared to rowstore data. To accelerate major compactions for columnar tables, the storage layer has adapted and optimized the compaction process. In addition to supporting horizontal splitting and parallel major compactions like rowstore tables, columnar tables also benefit from vertical splitting for accelerated major compactions. Multiple column merges can be combined into a single major compaction task, and the number of columns within a task can be dynamically adjusted based on available system resources. This ensures an optimal balance between major compaction speed and memory overhead.
Feature 2: Columnar encoding algorithm
In OceanBase Database, data is stored after being compressed in two stages. In the first stage, the data is compressed by using the hybrid row-column encoding algorithm, which is developed by OceanBase. In the second stage, the data is compressed using a general compression algorithm. The hybrid row-column encoding algorithm is an algorithm in the database, so it can support direct query without decompression. It also supports query filtering and acceleration based on the encoding information. However, the original hybrid row-column encoding algorithm is still row-oriented. Therefore, for columnar storage, OceanBase Database has independently developed a new columnar encoding algorithm. Compared with the original hybrid row-column encoding algorithm, the new columnar encoding algorithm can support vectorization in query execution and can implement SIMD optimization across different instruction sets. It can significantly improve the compression ratio for data of numeric types and comprehensively improve the performance and compression ratio of the original algorithm.
Feature 3: Skip Index
In common columnar databases, for each column's data, a certain level of granularity is used for pre-aggregation calculations. The results of these aggregations are stored together with the data. When a user requests to query the data, the database can filter out data using the pre-aggregated results, significantly reducing data access overhead and unnecessary I/O consumption. In the columnar engine, we've also added support for skip indexes. This means that for each column's data, aggregations such as max, min, and null count are performed at the microblock level. These aggregations are then aggregated and accumulated to obtain macroblock and SSTable-level aggregation values. During user queries, data can be filtered and aggregated based on the appropriate granularity of these aggregated values within the scanning range.
Feature 4: Query pushdown
Since OceanBase Database V4.x, the storage layer has fully adapted to vectorized execution and supports query pushdown in some scenarios. For the columnar storage engine, the query pushdown feature has been further enhanced and extended, specifically:
All queries will be filtered downward, and skipping indexes and encoding information will be further utilized to accelerate the process.
Aggregation function pushdown, for non-group-by scenarios, the count, max, min, sum, and avg aggregation functions can be pushed down to the storage engine.
support pushing down the group by operation to storage computing on the column with fewer distinct values. This uses dictionary information within microblocks for significant acceleration.
For more information about column stores, see Columnstore engine.
Shared storage mode
OceanBase Database supports two deployment modes: Shared-Nothing (SN) and Shared-Storage (SS) mode.
Shared-Storage (SS) mode is an architecture where data is centrally stored on shared storage devices, allowing multiple database nodes to access the same data. This mode is commonly used in public clouds. It simplifies storage management, improves resource utilization, and supports flexible high availability switching in specific scenarios.
Reduced storage redundancy: In a traditional Shared-Nothing architecture, each node holds an independent copy of the data, leading to potential storage resource waste. In contrast, shared storage allows multiple compute nodes to access the same data, reducing the number of replicas and lowering overall storage costs.
Fast failover: When a compute node fails, other nodes can directly access the data from the shared storage without relying on complex replica synchronization or migration processes, enabling faster service takeover.
Simplified operations and maintenance: A unified storage pool facilitates backup, snapshot, and monitoring operations, enhancing the maintainability of the database system.
The OceanBase Database's compute-storage separation architecture with two replicas can achieve the following:
- Cost-effective: Full data is stored in object storage, while hot data is cached in local disks, reducing storage costs while maintaining P99 query performance. It offers elastic scaling capabilities, allowing independent scaling of compute and storage resources. Performance loss compared to the Shared-Nothing architecture is typically between 0.3% and 1.7%.
- High availability: The 2F dual-replica deployment ensures high availability for compute nodes. Combined with a Paxos-based independent log storage service, the system guarantees RPO=0 and RTO<8s.
The OceanBase Database's compute-storage separation architecture with a single replica can achieve the following:
High cost-effectiveness: Full data is stored in low-cost object storage, while hot data is cached in local disks. The single-replica architecture reduces compute costs by 2-3 times compared to 2F1A and 3F cluster architectures. It also supports quick addition or removal of compute nodes within a single zone to meet changing performance needs.
High availability: With redundant object storage within the same region, it supports data center-level disaster recovery. In the single-replica architecture, combined with a Paxos-based independent log service and shared storage, the system can quickly launch new compute nodes when a node fails, ensuring high availability.
Vectorized execution engine
Vectorized execution is an efficient technique for batch processing data, and it significantly improves the execution performance of analytical queries. OceanBase Database introduced the vectorized execution engine in V3.2, but it is disabled by default. Starting from OceanBase Database V4.0, the vectorized execution engine is enabled by default. In OceanBase Database V4.3, the vectorized engine 2.0 was implemented, which greatly improved the execution performance of the vectorized engine through optimizations in data formats, operator implementations, and storage vectorization.

Data formats
In vectorized engine 2.0, a new columnar data format was introduced. In this format, data description information (null, len, ptr) is stored separately for each column, avoiding redundant storage. Three data formats were designed for different data types and use cases: fixed-length format, variable-length discrete format, and variable-length continuous format.
- Fixed-length format: The length value is stored only once, and no redundant storage is needed. Data locality is better. Compared with version 1.0, this format saves space, improves efficiency, and eliminates the need for pointer swizzling.
- Variable-length discrete format: Each data item may be discontinuous in memory. Each data item is described by an address pointer and a length. This format avoids deep copying during encoding and is suitable for short-circuit computation scenarios, avoiding data reorganization.
- Variable-length continuous format: Data is stored continuously in memory. The length information and offset address are described by an offset array. This format improves data access efficiency, but it requires reorganization and deep copying during short-circuit computation and columnar encoding projection. This format is mainly used for columnar materialization.
Operator and expression performance optimization
Vectorized engine 2.0 comprehensively optimizes operators and expressions. The core idea is to utilize the information of the new format and specialized data structures to reduce CPU cache misses and instruction overhead, thereby improving overall execution performance. The main optimizations include:
- Utilizing batch data attribute information: Maintaining batch data feature information eliminates the need for special handling and filtering of NULL values, optimizing SIMD calculations.
- Algorithm and data structure optimization: Optimizing intermediate result materialization structures supports row/column materialization. The Sort operator separates sort keys from non-sort keys, reducing cache misses during sorting and improving overall efficiency.
- Specialized implementation optimization: Optimizing for specific scenarios, such as encoding multiple fixed-length join keys into a single fixed-length column and specialized implementations for aggregation calculations, significantly improves execution efficiency.
Storage vectorization optimization
The storage layer fully supports the new vectorized formats. SIMD acceleration is used for projection, predicate pushing down, aggregation pushing down, and groupby pushing down. Projection templates are customized based on column types and lengths, simplifying calculations. Predicate calculations are performed directly on column-encoded data, enabling fast computation of complex expressions. Aggregation pushing down utilizes pre-aggregated information from intermediate layers for efficient handling of statistical functions. Groupby pushing down leverages encoded data information to significantly accelerate performance.
Real-time write
OceanBase Database adopts the log-structured merge-tree (LSM-tree) architecture, which ensures the real-time write capability of the database. This topic describes the real-time write capability of OceanBase Database.
Core storage mechanism
LSM-tree
To support efficient real-time writes, OceanBase Database stores incremental data in an LSM-tree. An LSM-tree is a special tree structure designed to optimize write operations. The core idea is to record write operations in a structure in memory and asynchronously batch write them to disk when a certain amount is reached. This significantly reduces the number of disk I/O operations and improves write performance. OceanBase Database periodically merges incremental data with baseline data through dumping and compaction processes to ensure data consistency and integrity.
Real-time write capability
The LSM-tree architecture of OceanBase Database ensures excellent performance in handling real-time data writes. Whether it's small-scale data updates or large-scale data imports, OceanBase Database can quickly respond and ensure real-time writes. Its main features are as follows:
- Efficient write processing: OceanBase Database processes write operations in an LSM-tree, reducing disk operations and improving write efficiency.
- Immediate query availability: Data written to the LSM-tree's in-memory structure is immediately available for queries, ensuring real-time data availability.
- Optimized data compaction: Intelligent dumping and compaction strategies support efficient queries.
- Strong concurrency handling: Leveraging a distributed architecture, OceanBase Database can parallelize write operations across multiple nodes, significantly enhancing real-time data processing capabilities.

Optimizer
The query optimizer of OceanBase Database is designed for HTAP hybrid workloads and real-time analytics. Transactional workloads typically have the following characteristics: small data access per query, high response time (RT) requirements, and high throughput. Analytical workloads typically have the following characteristics: large data access per query and relatively low throughput. The execution plans and optimization methods required for different types of workloads also vary significantly.
- For transactional workloads, it is usually necessary to create appropriate index structures for filtering predicates and join predicates on base tables. Each table should select the most suitable index path to significantly reduce data scanning.
- For analytical workloads, it is usually necessary to perform full-table scans on columnar storage and use skipindex to quickly skip the scanning of some data blocks. The query response time can be reduced by increasing the parallelism.
The OceanBase AP optimizer is based on the TP optimizer and has been enhanced for various complex query scenarios.
More comprehensive query rewriting capabilities: The query rewriting module supports a wide range of rewriting algorithms. Different rewriting algorithms are matched to different scenarios and perform equivalent transformations to improve the business SQL queries. Currently, the following rule-based rewriting algorithms are supported: view merging, subquery lifting, inner join elimination, outer join elimination, and elimination of always-true or always-false conditions. Additionally, the following cost-based rewriting algorithms are supported: OR expansion, JA subquery lifting, Win Magic, and Group-By Placement.
Distributed plan generation: OceanBase Database uses a one-phase approach for distributed plan generation, which is different from the two-phase approach commonly used in other database systems. In the two-phase approach, an optimal single-machine plan is first generated and then distributed. However, this approach faces more challenges in HTAP scenarios. For example, in a single-machine scenario, the optimal NESTED LOOP JOIN algorithm may result in a significant amount of cross-node data access after distribution, leading to a much lower overall execution efficiency compared to the distributed HASH JOIN algorithm. The one-phase plan generation framework of OceanBase Database considers factors such as data distribution characteristics and parallelization during the enumeration of join orders and join algorithms. It comprehensively evaluates various factors to select the overall optimal distributed execution plan.
Row and column path selection: To effectively handle HTAP hybrid workloads, the query optimizer employs an intelligent optimization capability for automatic row and column path selection. This means that for a given query, the optimizer intelligently decides whether to read data from the row store or the column store based on the data access characteristics to achieve optimal execution performance. OceanBase Database has designed a specific cost model for columnar scans and introduced a new statistical information mechanism to accurately evaluate the benefits of columnar SkipIndex on columnar scans. This ensures accurate decision-making between row store indexes and columnar scans.
Easier-to-use Auto DOP: Most databases use parallel execution to accelerate complex SQL queries. However, in real-world scenarios, it is challenging to easily evaluate whether to enable parallelism and the appropriate degree of parallelism. OceanBase Database provides the Auto DOP feature, which allows the optimizer to automatically determine whether to enable parallelism and the appropriate degree of parallelism based on the estimated execution time of the query. This ensures that SQL queries can achieve optimal performance by default.
Execution plan management: The SQL Plan Management (SPM) technology in OceanBase Database is crucial for ensuring the stable and continuous operation of real-time analytical workloads. It intelligently manages the evolution of execution plans to address scenarios where the optimizer may select a suboptimal new execution plan due to significant changes in data volume, updates to statistical information, or database version upgrades. By using real traffic for gray-scale validation, the SPM technology prevents execution plan regression.
For more information, see Overview of statistics and cost-based optimization.
Materialized views
The core value of smart materialized views lies in the fact that users only need to define the data processing results through SQL, and OceanBase Database automatically manages data refreshes and computational dependencies. This means that enterprises do not need to write complex ETL scripts or manage data pipelines. They only need to define the target data structure, and OceanBase Database will automatically handle data refreshes, dependency management, and performance optimization. This approach greatly simplifies the data engineering process, reduces operational costs, and ensures data freshness and consistency.
OceanBase Database supports asynchronous materialized views. This means that when the data in the base table changes, the materialized view is not immediately updated. This ensures the execution performance of DML operations on the base table. However, the data in the materialized view may be delayed relative to the base table and needs to be refreshed and updated in a timely manner. The update of materialized view data supports two strategies: full refresh and incremental refresh.
- A full refresh directly re-executes the query statements corresponding to the materialized view, fully calculates, and overwrites the original view result data. This strategy is suitable for scenarios with low latency requirements, low update frequency, or small data volumes, such as daily or weekly data summary reports.
- An incremental refresh only processes the data changes since the last refresh. This method significantly reduces the time and resources required for refreshes. To achieve accurate incremental refreshes, OceanBase Database implements a materialized view log feature similar to Oracle's MLOG (Materialized View Log). By meticulously recording incremental updates to the base table, OceanBase Database ensures that materialized views can quickly perform incremental refreshes. This incremental refresh strategy is particularly suitable for scenarios with high latency requirements, large data volumes, and frequent changes, such as in real-time transaction systems where data may change every minute or even every second.
OceanBase Database also supports real-time materialized views, enabling real-time data analysis. Real-time materialized views use the materialized view log mechanism to capture and process changes in the base table data. During queries, they perform online calculations and integrate data changes. Even if the materialized view does not physically store the latest changes, it ensures that users receive query results consistent with those obtained by directly querying the base table. Additionally, OceanBase Database leverages the query rewriting capabilities of materialized views to achieve transparent query acceleration.
OceanBase Database allows users to specify a primary key or create an index for a materialized view, optimizing the performance of single-row lookups, range queries, and join scenarios based on the primary key or index. If a materialized view is a large wide table formed by joining multiple tables, creating a columnar storage materialized view can enhance the performance of certain queries. Furthermore, by creating partitioned materialized views, users can leverage partition pruning to reduce the amount of data processed.
In the latest OceanBase Database V4.3.5 release, the Nested Materialized View (Nested MV) feature was introduced. This allows users to easily build new materialized views based on existing ones, which is particularly useful in data warehouse ETL processes. Nested materialized views can generate intermediate results during the data transformation and loading stages, which can then be used as inputs for subsequent processing steps, further optimizing the overall data processing efficiency.
For more information about materialized views, including detailed descriptions and usage guidance, see Overview of materialized views (MySQL mode) and Overview of materialized views (Oracle mode).
Key features of smart materialized views:
- Declarative SQL Definition: Smart materialized views allow users to declaratively define data processing results without manually managing transformation steps. Users simply write standard SQL queries to specify data transformation logic, and OceanBase Database handles the execution and maintenance of these queries.
- Automatic Refresh Mechanism: OceanBase Database automatically handles the cascading relationships of data refreshes based on the user-specified target data freshness requirements, including scheduling and execution. Users only need to define the desired level of data freshness, such as a 30-second or 5-minute target lag, and OceanBase Database will ensure the data meets this requirement.
- Incremental Processing Optimization: Smart materialized views use automatic incremental view maintenance technology, calculating only the changes since the last refresh instead of performing a full refresh. This significantly improves performance and reduces computational costs.
- Dependency Management: Smart materialized views automatically track data dependencies, ensuring that refreshes occur in the correct order when base data changes. This eliminates the need to manually manage complex data pipeline dependencies.
Multimodal data types
OceanBase Database is a distributed, multimodal database that can process structured, semi-structured, and unstructured data in a single database using the SQL engine in real-time analytics scenarios.
The Large Object (LOB) serves as the storage foundation for multimodal data. LOB plays a crucial role in AI scenarios, where multimodal data such as images, text, audio, and video are all large objects. The performance of LOB significantly impacts the computational efficiency in AI pre-training and other scenarios. OceanBase Database provides a highly efficient LOB implementation at the storage layer, supporting up to 512 MB of storage under SQL. Using the DBMS.Lob package, it can efficiently handle large objects at the TB level.
In data analysis scenarios, OceanBase Database natively supports multimodal data types such as Array, RoaringBitmap, and Map. These data types enable efficient storage and querying of structured data, as well as direct support for semi-structured data and complex aggregation analysis.
- RoaringBitmap/Map: Ideal for advanced data mining scenarios such as large-scale tag analysis, audience targeting, and deduplication aggregation.
- Array: Offers great flexibility for complex business needs such as log retrieval, behavior tracking, and multi-dimensional tagging, truly achieving one database for multiple capabilities.
For detailed information on the RoaringBitmap data type, see RoaringBitmap data type.
For detailed information on the Array data type, see Array data type.
JSON, the most widely used semi-structured data type, is extensively applied in transaction, analytics, and even AI scenarios.
- In transaction scenarios, JSON serves as a schemaless, elastic columnar storage and computing solution.
- In analytics scenarios, JSON's multi-value indexing is ideal for flexible multi-dimensional tag calculations.
- In AI scenarios, JSON acts as a bridge between models and applications, and it also serves as a standard input/output format in data processing pipelines such as agents and workflows.
To support JSON across various workloads, OceanBase Database provides rich computational expressions and JSON-based multi-value indexing. At the storage layer, it supports JSON Binary, optimizes random read/write operations for JSON data, and enables structured encoding for JSON. By extracting structured information from similar JSON data, it significantly improves the storage compression ratio and query performance based on JSON paths. For detailed information on the JSON data type, see JSON data type.
In the era of generative AI, the processing of multimodal data is becoming increasingly important. OceanBase Database enhances vector and full-text search capabilities to better support hybrid search requirements in knowledge retrieval scenarios. Based on a fully self-developed approach combining vector algorithms and databases, OceanBase's vector capabilities outperform open-source vector databases in VectorDBbench performance.
Special indexes
In traditional relational databases, indexes such as B-Trees and Hash indexes are primarily used for exact value queries on structured data (such as numbers and dates). However, in real-world scenarios, with the increasing prevalence of semi-structured data (such as JSON documents), unstructured text (such as logs and long texts), and multidimensional analysis, OceanBase Database provides two types of special indexes:
- Full-text indexes: Implemented based on inverted indexes, full-text indexes use tokenization technology to establish keyword mappings for text content. These are suitable for log analysis and document retrieval.
- Multivalued indexes: These indexes are created for JSON array fields at the element level. The arrays are expanded into virtual row records, and B-Tree indexes are built, significantly improving the query efficiency for set data.
These two types of indexes, through their differentiated data structure designs, provide targeted optimizations for different data types (text/JSON) and query patterns (fuzzy matching), collectively forming an acceleration layer for complex data queries.
Full-text indexes
In relational databases, indexes are typically used to accelerate exact value matching queries. However, traditional B-Tree indexes often fail to meet performance requirements when handling large volumes of text data and fuzzy searches. In such cases, performing a full table scan to find matching data row by row can lead to performance bottlenecks, especially when dealing with large amounts of text and data. Additionally, complex query needs such as approximate matching and relevance sorting are difficult to achieve through simple SQL rewriting.
To address these challenges, OceanBase Database now supports full-text indexes compatible with MySQL. Full-text indexes preprocess text content and build keyword indexes, significantly improving the efficiency of full-text searches.
Full-text indexes are used to quickly search for text data. Their main features include:
- Full-text search: By building full-text indexes, the entire document or large segments of text can be indexed, enabling more flexible and efficient searches.
- Quick lookup: Users can quickly find matching text in the database based on input keywords, greatly reducing search time.
- Efficient handling of large volumes of text: Full-text indexes can effectively handle various types of text data, including articles, reports, web pages, and emails, providing users with accurate and fast search experiences.
- Support for complex queries: In addition to basic keyword searches, full-text indexes also support complex query needs such as approximate matching and relevance sorting, greatly enriching the database's search capabilities.
By introducing full-text indexes, OceanBase Database can significantly improve query performance when dealing with large-scale text data and complex retrieval requirements, allowing users to efficiently obtain the information they need.
For more information about full-text indexes, see Full-text indexes (MySQL mode).
Multivalued indexes
Multivalued indexes are a special index feature in OceanBase Database's MySQL mode, primarily used for handling JSON documents and set data types. They are suitable for scenarios where multiple values or multiple attributes need to be queried. Their main features are as follows:
- Indexes can be created on arrays or sets.
- Currently, they are applicable to JSON documents.
- They can improve the query efficiency for searches based on JSON array elements.
Multivalued indexes are particularly useful in AP (Analytical Processing) scenarios. AP scenarios typically involve complex data analysis and report generation, and multivalued indexes can accelerate these operations. For example, in a data warehouse, multivalued indexes can be used to speed up multidimensional data analysis and improve report generation efficiency. Specific applicable scenarios include:
- Many-to-many association queries: Multivalued indexes can optimize queries for many-to-many relationships between entities. For example, in the relationship between actors and movies, all actors of a movie can be stored in a JSON array, and a JSON multivalued index can be used to quickly query all movies an actor has participated in.
- Tag and category queries: When entities have multiple tags or categories, multivalued indexes can accelerate relevant queries. For example, multiple tags of a product can be stored in a JSON array, and a JSON multivalued index can be used to quickly find products with specific tags.
For more information about multivalued indexes, see Multivalued indexes.
MySQL ecosystem compatibility
OceanBase Database is not only committed to delivering exceptional performance and scalability but also strives to ensure high compatibility with the MySQL ecosystem. This compatibility allows for seamless migration of MySQL applications to OceanBase Database and enables the effective utilization of existing OLAP tools and technologies, facilitating rapid iteration and innovation in data analysis and business insights.
Syntax compatibility: OceanBase Database fully supports the SQL standard syntax of MySQL, including, but not limited to, Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). This means that SQL statements you previously used in MySQL, such as data queries, table structure definitions, index creation, and permission management, can almost be directly executed in OceanBase Database without significant syntax adjustments, significantly reducing migration costs and the learning curve.
- Seamless migration: Existing MySQL applications can be quickly migrated to OceanBase Database, minimizing the need for code modifications during the migration process.
- Skill reuse: MySQL developers and DBAs do not need to learn new database syntax, shortening the adaptation period.
- Ecosystem integration: The compatibility with MySQL ecosystem syntax allows OceanBase Database to better integrate with existing BI, ETL, and data visualization toolchains.
View compatibility: OceanBase Database is compatible with MySQL's information_schema views, such as:
- Table information queries: Supports views like TABLES and COLUMNS, allowing users to query the structure and column information of all tables in the database. This is crucial for data dictionary management and integration with third-party tools.
- Permission management: Supports views like SCHEMATA and SCHEMA_PRIVILEGES, helping administrators easily view and manage database and table permissions.
Many database management, monitoring, and analysis tools rely on INFORMATION_SCHEMA to retrieve database status and schema information. The compatibility of OceanBase Database with INFORMATION_SCHEMA allows these tools to run directly on OceanBase Database without customization. OceanBase Database supports various OLAP ecosystem tools, such as:
OceanBase Database supports full-stack data integration from batch to streaming:
- Seamlessly integrates with mainstream tools like Flink, Kafka, OMS, OBLOADER, DataWorks, and dbt, making it easy to connect with existing ETL, real-time synchronization, and data warehouse environments in enterprises.
- The database kernel supports features like direct load, external tables for files/ODPS/HDFS, and External Catalog, significantly lowering the development and operational overhead for integrating multi-source data and connecting with data lakes.
- Highly compatible with MySQL and Oracle syntax, mainstream BI, analytics, and stream processing tools (such as Tableau, Power BI, Flink, and Quick BI) can be zero-code migrated and integrated, truly achieving "out-of-the-box" usability.
OceanBase Database innovates not only in its core engine but also in adapting to various ecosystem products:
- Automated orchestration and scheduling: Deeply integrated with scheduling platforms like DolphinScheduler, n8n, and Airflow, enabling automatic orchestration and maintenance of multi-step data processing tasks, enhancing large-scale data governance and multi-system collaboration efficiency.
- Observability: Seamlessly integrated with monitoring tools like Prometheus and Grafana, supporting real-time performance monitoring, alerts, and intelligent operations for multi-tenant, node, and cluster environments, facilitating stable operation of large-scale clusters and multi-business lines.
- Data visualization empowerment: Fully compatible with mainstream BI platforms such as Superset, Tableau, QuickBI, and Guan Yuan BI, supporting self-service analysis by multiple roles, interactive dashboards, and complex business reports, aiding in business insights and data-driven decision-making.
For more information about the current OLAP ecosystem integration of OceanBase Database, see Ecosystem integration.
