Core features of OceanBase AP|V4.4.2|OceanBase Database| docs|Distributed Database

The key capabilities of AP are summarized as follows:

Columnar storage engine

In scenarios involving large-scale data analytics or ad-hoc queries on massive datasets, columnar storage is a key capability of an AP database. Columnar storage is a method of organizing data files that differs from row-based storage by physically arranging table data by columns. When data is stored in a columnar format, analysis queries can scan only the columns required for computation, avoiding full-row scans. This reduces resource consumption such as I/O and memory usage while improving computation speed. Additionally, columnar storage inherently provides better conditions for data compression, often achieving higher compression ratios, which reduces storage space requirements and network bandwidth.

Based on the Log Structured Merge-Tree (LSM-Tree) data storage structure, OceanBase has continuously optimized its high-concurrency transaction processing (TP) capabilities, continuously improving performance in scenarios such as random writes, real-time updates, and strong consistency. Through extensive engineering practice experience, it has developed an independently controllable storage engine technology system. Meanwhile, the hierarchical compaction feature and static data organization capability of LSM-Tree make it naturally well-suited for OLAP scenarios characterized by batch writes and infrequent updates. By leveraging columnar data compression, hierarchical compaction strategies, and storage fragmentation optimization, it meets the efficient scanning requirements of analytical workloads while supporting unified architecture for mixed TP and AP workloads.

In V4.3, building upon its existing technical foundation, OceanBase further expanded its storage engine to support columnar storage, achieving storage integration. With one codebase, one architecture, and one OBServer, columnar and row data coexist, truly balancing TP and AP query performance.

Columnar storage engine architecture

As a native distributed database, OceanBase Database stores user data in multiple replicas by default. To leverage the advantages of multi-replica deployment and provide users with enhanced experiences such as strong data verification and reuse of migrated data, its self-developed LSM-Tree storage engine incorporates several targeted designs. First, the entire user data can be broadly divided into two parts: baseline data and incremental data.

Baseline data

Global consistent version control: Breaking away from the traditional LSM-Tree design paradigm, OceanBase Database leverages the foundation of its distributed multi-replica architecture to implement a "daily major compaction" mechanism. The system can periodically or on-demand trigger the selection of a global version number. A tenant regularly or based on user operations selects a global version number. All replicas complete a major compaction based on this version to generate the baseline data for that version, ensuring the baseline data of the same version is physically identical across all replicas.
Multi-state storage support: Baseline data supports three physical forms: row-based, columnar, and hybrid row-column storage. Users can flexibly choose through table creation configurations to meet storage requirements of different business scenarios.

Incremental data

Dynamic multi-version management: All writes after the latest baseline version (whether data freshly written to MemTable or data already flushed to SSTable) are classified as incremental data. Each replica independently maintains multi-version records without consistency. Unlike baseline data which is generated based on a specified version, incremental data contains all multi-version data.
Row-first strategy: Incremental data is forced to adopt the row-based mode, ensuring the transaction processing (TP) pipeline is fully compatible with the native row-based architecture, sharing core components such as transaction logs and locking mechanisms.

Hybrid row-column storage architecture

Leveraging the controllable update characteristics of columnar storage scenarios, OceanBase Database combines the properties of its baseline and incremental data to propose a columnar storage implementation method that is transparent to upper-layer applications:

Storage form decoupling: Baseline data is organized in a columnar format (each column as an independent SSTable, virtually combined into a logical table), while incremental data remains row-based. DML operations and synchronization with upstream and downstream are completely seamless.
Dynamic synchronization engine: A bidirectional synchronization pipeline for row and column data is built at the underlying layer, supporting smooth transition for OLAP system migration or row-based upgrade. Business operations do not need to be aware of storage format differences.
Intelligent routing mechanism: From the optimizer to the executor, the system automatically selects the optimal row/column access path based on workload characteristics, fully unleashing the performance advantages of columnar storage in AP scenarios while retaining the native support of row-based storage for TP transactions.

Key integrated capabilities

Capability Dimension	Key technologies
Unified SQL	Columnar storage-aware cost model and statistics system Vectorized engine supports adaptive batch processing (Adaptive Batch Size) Adaptive selection of execution engines for hybrid plans
Integrated storage	Dynamically configurable table-level storage modes (row-based, columnar, or hybrid) Transparent DDL support (online schema changes and secondary indexes)
Integrated transactions	Ensures ACID compliance for incremental row-based storage. Shared transaction logs and multi-version control in row-based storage.

Capability Dimension

Key technologies

Unified SQL

Columnar storage-aware cost model and statistics system

Vectorized engine supports adaptive batch processing (Adaptive Batch Size)

Adaptive selection of execution engines for hybrid plans

Integrated storage

Dynamically configurable table-level storage modes (row-based, columnar, or hybrid)

Transparent DDL support (online schema changes and secondary indexes)

Integrated transactions

Ensures ACID compliance for incremental row-based storage.

Shared transaction logs and multi-version control in row-based storage.

Columnar storage replica

To meet the requirement for strong physical isolation of TP and AP resources in HTAP mixed-workload scenarios, OceanBase Database supports Column Store Replica (C-Replica). A C-Replica is a new type of replica with read-only properties, and all baseline data of user tables on the replica is stored in columnar format. C-Replicas are deployed in independent zones. OLAP services access C-Replicas through a dedicated ODP endpoint and execute queries with weak-consistency reads, leveraging the advantages of batch processing in columnar storage to accelerate queries without impacting existing OLTP services. In the 2F1A deployment mode, compared to the all-row-column hybrid storage method, the 2F1A1C architecture not only achieves strong physical isolation of TP and AP but also saves the storage overhead of one replica.

Columnar replicas fully adhere to the rules of regular read-only replicas in terms of replica distribution strategy and weak-consistency read release mechanism. The main difference between them lies in the underlying data storage structure. Similar to regular read-only replicas, columnar storage replicas do not participate in leader election or log synchronization voting processes, but they still contain all core components such as static data tables, committed logs, and in-memory data tables.

For more information about columnar storage replicas, see Columnar storage replica. For deployment and independent ODP access steps of columnar storage replicas, see Deploy and use columnar storage replicas.

Main features of columnar storage

Feature 1: Adaptive Compaction

After introducing the new columnar storage mode, data compaction behavior has changed significantly compared to the original row-based storage. Since all incremental data is row-based, it needs to be merged with baseline data and then split into independent SSTables for each column. As a result, the compaction time and resource consumption are relatively higher than with row-based storage. To accelerate compaction for columnar tables, the storage layer has also adapted and optimized the Compaction process. For columnar tables, in addition to horizontal splitting and parallel compaction acceleration like row-based tables, vertical splitting acceleration has been added. Columnar tables combine multiple-column compactions into a single task, and the number of columns within a task can be independently adjusted based on system resources, ensuring a better balance between overall compaction speed and memory overhead.
Feature 2: Columnar Encoding Algorithm

OceanBase Database compresses data in two stages. The first stage is OceanBase's proprietary hybrid row-column encoding compression, and the second stage is general-purpose compression. Since the hybrid row-column encoding is a built-in database algorithm, it supports direct querying without decompression and can utilize encoding information to accelerate query filtering. However, the original hybrid row-column encoding algorithm still leaned towards row organization. Therefore, a new columnar encoding algorithm has been implemented for columnar tables. Compared to the original encoding algorithm, the new algorithm supports fully vectorized execution of queries, compatible SIMD optimizations for different instruction sets, and significantly improves compression ratio for numeric types, achieving a comprehensive upgrade in performance and compression ratio over the original algorithm.
Feature 3: Skip Index

Common columnar databases typically perform pre-aggregation calculations on each column of data at certain granularity. The aggregated results are persisted together with the data. When a user queries column data, the database can filter data using the pre-aggregated data, greatly reducing data access overhead and unnecessary I/O consumption. In the columnar engine, skip index support has also been added. For each column, aggregation calculations are performed at multiple dimensions such as maximum value, minimum value, and total nulls at the microblock granularity. These values are then aggregated upwards layer by layer to obtain larger-granularity aggregate values for macroblocks, SSTables, etc. User queries can continuously drill down based on the scan range to select appropriate granularity aggregate values for filtering and aggregated output.
Feature 4: Query Pushdown

Starting from V4.x, OceanBase Database's storage layer operators and expressions have been fully adapted for vectorized execution and support query pushdown in some scenarios. In the columnar engine, the pushdown feature has been further enhanced and extended, specifically including:
- Pushdown of all query filters, which can also leverage skip index and encoding information for acceleration depending on the filter type.
- Pushdown of common aggregate functions. In non-group by scenarios, aggregate functions such as count/max/min/sum/avg can now be pushed down to the storage engine.
- Group by pushdown. On columns with low NDV, group by pushdown for storage computation is supported, leveraging dictionary information within microblocks for significant acceleration.

For detailed information and usage guidelines on columnar storage, see Columnar storage.

Shared Storage Mode

OceanBase Database supports two deployment modes: Shared-Nothing (SN) and Shared-Storage (SS) modes.

Shared-Storage (SS) mode is an architecture where data is centrally stored on shared storage devices, allowing multiple database nodes to access the same data set. It is applied in public clouds. This mode is mainly used to simplify storage management, improve resource utilization, and support more flexible high availability switchover in specific scenarios.

Reduce storage redundancy: In the traditional Shared-Nothing architecture, each node holds an independent data copy, which can lead to storage resource waste. Shared storage allows multiple compute nodes to read from the same data, reducing the number of replicas and thus lowering overall storage costs.
Fast fault recovery: When a compute node fails, other nodes can directly access the data in shared storage, eliminating the need for complex replica synchronization or migration processes, enabling faster service takeover.
Simplify operations and maintenance: A unified storage pool facilitates operations such as backup, snapshot, and monitoring, improving the maintainability of the database system.

The OceanBase-based storage-compute separation 2-replica architecture enables:

Low cost: Full data is stored in object storage, with hot data cached on local disks, reducing storage costs while ensuring P99 query performance. It has elastic scaling capabilities, allowing compute and storage to scale independently. Performance loss relative to the Shared-Nothing architecture averages 0.3% to 1.7%.
High availability: The 2F dual-replica deployment mode ensures high availability for compute nodes. Combined with a Paxos-based independent log storage service, the system guarantees RPO=0 and RTO<8s.

The OceanBase-based storage-compute separation single-replica architecture enables:

Ultimate cost-effectiveness: Full data is stored on low-cost object storage, with hot data cached on local disks. In single-replica form, compared to cluster forms like 2F1A or 3F, it reduces compute costs by 2-3 times. It also supports rapid addition or removal of compute nodes within a single zone to meet changing performance requirements.
High availability: Redundant object storage within the same city supports data center-level disaster recovery. In single-replica form, combined with a Paxos-based independent log service and shared storage, if a compute node fails, a new compute node can be quickly started to provide high availability.

Vectorized Execution Engine

Vectorized execution is an efficient batch-processing technique that can significantly improve performance in analytical queries. OceanBase Database introduced the vectorized execution engine in V3.2, but it was disabled by default. Starting from V4.0, the vectorized execution engine is enabled by default. In OceanBase Database V4.3, Vectorized Engine 2.0 was implemented, which greatly enhances execution performance through optimizations in data formats, operator implementations, and storage vectorization.

Introduction to Data Formats

In Vectorized Engine 2.0, a new columnar data format is introduced, storing data description information (null, len, ptr) separately by column to avoid redundant storage. Three data formats are designed based on different data types and usage scenarios: fixed-length format, variable-length discrete format, and variable-length contiguous format.

Fixed-length format: The length value only needs to be stored once, eliminating the need for redundant storage and enabling direct access, resulting in better data locality. Compared to version 1.0, this format saves space, improves efficiency, and omits pointer swizzling operations.
Variable-length discrete format: Each data item may not be contiguous in memory; each data item is described by an address pointer and its length. This format avoids deep copying during data encoding, making it suitable for short-circuit evaluation scenarios and avoiding data reorganization.
Variable-length contiguous format: Data is stored contiguously in memory, with length information and offset addresses described by an offset array. This format improves data access efficiency, but requires reorganization and deep copying during short-circuit evaluation and columnar encoding projection. It is primarily used for columnar materialization scenarios.

Optimization of Operator and Expression Performance

Vectorized Engine 2.0 comprehensively optimizes operators and expressions. The core approach is to leverage information from the new formats and specialized data structure implementations to reduce CPU cache misses and instruction overhead, thereby improving overall execution performance. Key optimizations include:

Utilizing batch data attribute information: Maintaining batch data characteristic information eliminates special handling and filtering judgments for NULL, optimizing SIMD computation.
Algorithm and data structure optimization: Optimized intermediate result materialization structures support row/column materialization. The Sort operator implements separate materialization of sort keys and non-sort keys, reducing cache misses during sorting and improving overall efficiency.
Specialized implementation optimization: Optimizations are tailored for specific scenarios, such as encoding multi-column fixed-length join keys into a single fixed-length column and specialized implementations for aggregate calculations, significantly improving execution efficiency.

Storage Vectorization Optimization

The storage layer fully supports the new vectorized format, accelerating projections via SIMD, pushing predicates, pushing aggregates, and pushing groupby operations. Projection uses customized templates based on column type and length to simplify computation; predicate evaluation is performed directly on the column encoding, enabling rapid computation of complex expressions. Pushing aggregates leverages pre-aggregated information from intermediate layers to efficiently handle statistical functions, while pushing groupby leverages encoded data information to significantly accelerate performance.

Overall, performance comparisons between the vectorized and non-vectorized engines in OceanBase Database show that vectorization can improve performance by an order of magnitude:

Real-time Writing

OceanBase Database adopts the Log-Structured Merge-tree (LSM-Tree) architecture, which is designed to ensure real-time writing capabilities. The following section provides a detailed introduction to the real-time writing capability of OceanBase Database.

Core Storage Mechanism

LSM-Tree

To support efficient real-time writing, OceanBase Database uses the LSM-Tree structure to store incremental data. The LSM-Tree is a specialized tree structure designed to optimize write operations. Its core idea is to first record write operations in an in-memory structure, and then asynchronously batch them to disk when a certain threshold is reached. This significantly reduces disk I/O and improves write performance. OceanBase Database periodically merges this incremental data with baseline data through minor and major compactions, ensuring data consistency and integrity.

Real-time Writing Capability

The design of the LSM-Tree architecture enables OceanBase Database to deliver exceptional performance in handling real-time data writes. Whether it's small-scale data updates or large-scale data imports, OceanBase Database responds quickly, ensuring real-time data writing. This is primarily reflected in the following aspects:

Efficient write processing: Through the LSM-Tree, OceanBase Database centralizes write operations, reducing disk operations and improving write efficiency.
Immediate data accessibility: Once data is written into the in-memory structure of the LSM-Tree, it becomes immediately available for queries, ensuring real-time access.
Optimized data merge process: Intelligent minor and major compaction strategies support efficient querying.
Strong concurrent processing capability: Leveraging its distributed architecture, OceanBase Database can process write operations in parallel across multiple nodes, greatly enhancing real-time data processing capabilities.

Optimizer

The query optimizer of OceanBase Database is designed for HTAP mixed-workload and real-time analytics scenarios. Generally, transactional workloads have the following typical characteristics: a small amount of data accessed per query, high RT requirements, and higher throughput; analytical workloads have the following typical characteristics: a large amount of data accessed per query, and relatively lower throughput. Different types of workloads also require different execution plan forms and tuning methods.

Transactional workloads typically require creating appropriate index structures for base table filter predicates and join predicates, requiring each table to select an appropriate index path whenever possible to significantly reduce data scanning.
Analytical workloads typically rely on columnar storage for full-table scans, using skipindex to quickly skip the scanning of some data blocks; they reduce query response time by increasing parallelism.

While fully inheriting the capabilities of the TP optimizer, the OceanBase AP optimizer has undergone specialized enhancements for various complex query scenarios.

More comprehensive query rewrite capability: The query rewrite module supports a rich set of rewrite algorithms. Different rewrite algorithms match different patterns and perform corresponding equivalence transformations to transform business SQL into a "better" form. Currently, it supports rule-based rewrite algorithms such as view merging, subquery lifting, inner join elimination, outer join elimination, and elimination of trivial conditions. It also supports cost-based rewrite algorithms such as OR Expansion, JA subquery lifting, Win Magic, and Group-By Placement.
Distributed plan generation: OceanBase adopts a phase-one distributed plan generation approach. This is distinctly different from the two-phase distributed plan generation used by common database systems. Common systems typically first generate an optimal single-machine plan and then distribute it to simplify optimizer design. However, this approach faces more problems in HTAP scenarios. For example, during connection enumeration, the optimal NEST-LOOP JOIN algorithm in a single-machine scenario may result in extensive cross-machine data access after distribution, making its overall execution efficiency significantly worse than the distributed HASH JOIN algorithm. OceanBase's phase-one plan generation framework considers factors like data distribution and parallelization while enumerating connection order and algorithms, making multi-factor decisions to select an overall better distributed execution plan.
Row-column path selection: To handle HTAP mixed workloads effectively, a core intelligent optimization capability of the query optimizer is automatic row-column path selection. This means that for a query, the optimizer intelligently determines whether to read data from row store or column store based on its data access characteristics, aiming for optimal execution performance. OceanBase designs a targeted cost model for columnar storage scans and introduces a new statistics mechanism to accurately assess the benefits of columnar storage skipindex for such scans, ensuring accurate decisions between row store indexing and columnar storage scanning.
Easier-to-use Auto DOP: Databases often use parallel execution to accelerate complex SQL execution, but in practical business scenarios, it is difficult to easily assess whether to enable parallelism or determine the appropriate degree of parallelism. OceanBase provides Auto DOP capability. When generating a plan, the optimizer evaluates the time required for query execution and automatically determines whether to enable parallelism and sets an appropriate degree of parallelism, so SQL can achieve optimal performance by default.
Execution plan management: OceanBase's SPM (SQL Plan Management) technology is a key technique for ensuring the stable long-term operation of real-time analytics workloads. It intelligently manages execution plan evolution to address situations where drastic data volume changes, statistics updates, or database version upgrades may cause the optimizer to choose a worse new execution plan. It prevents execution plan regression through gray-box validation with real traffic.

For more information, see Overview of statistics and estimation mechanisms.

Intelligent materialized views

The core value of intelligent materialized views lies in the fact that users only need to define the data processing result through an SQL statement, and OceanBase automatically manages data refresh and dependency computation. This means enterprises do not need to write complex ETL scripts or manage data pipelines; they only need to define the target data structure, and OceanBase will automatically handle data refresh, dependency management, and performance optimization. This approach greatly simplifies the data engineering process, reduces operational costs, and ensures data freshness and consistency.

OceanBase Database supports asynchronous materialized views. When the data in the base table changes, the materialized view is not updated immediately, which ensures the execution performance of DML operations on the base table. However, there is a delay between the data in the materialized view and the base table, and dependencies rely on timely refreshes to update the data. Two strategies are supported for updating materialized view data: complete refresh and incremental refresh.

A complete refresh directly re-executes the query statement corresponding to the materialized view, completely computes and overwrites the original view result data. It is suitable for scenarios with low latency requirements, infrequent base table data updates, or small data volumes, such as daily or weekly data summary reports.
An incremental refresh processes only the data changes since the last refresh. This method can significantly reduce the time and resources required for the refresh. To enable precise incremental refresh, OceanBase Database implements a materialized view log function similar to Oracle MLOG (Materialized View Log). By detailing the incremental update data of the base table, it ensures the materialized view can be refreshed incrementally quickly. The incremental refresh method is particularly suitable for business scenarios with high latency requirements, massive data volumes, and frequent changes, such as in real-time transaction systems where data may change every minute or even every second.

OceanBase Database also supports real-time materialized views, which enable real-time data analysis. Real-time materialized views utilize the materialized view log mechanism to capture and process changes in base table data, computing and integrating data changes online during queries. Even if the materialized view does not physically store the latest changed data, it ensures users receive query results consistent with directly querying the base table. At the same time, it leverages the query rewrite capability of materialized views to achieve transparent query acceleration.

OceanBase Database supports specifying a primary key or creating an index for a materialized view to optimize the performance of single-row lookup, range queries, or join scenarios based on the primary key or index. If a materialized view is a large wide table formed by multi-table JOINs, you can improve the performance of certain queries by creating a columnar storage materialized view. Furthermore, by creating a partitioned materialized view, you can leverage partition pruning to reduce the amount of data operated on.

In the latest OceanBase V4.3.5 version, the Nested Materialized View (Nested MV) feature was introduced, allowing you to easily build new materialized views based on existing ones. This is suitable for ETL processes in data warehouses, where nested materialized views can generate intermediate results during the data transformation and loading phases. These results can serve as input for subsequent processing, further optimizing the efficiency of the entire data processing pipeline.

For detailed information and usage instructions on materialized views, see Materialized view overview (MySQL-compatible mode) and Materialized view overview (Oracle-compatible mode).

Key features of intelligent materialized views:

Declarative SQL definition: Intelligent materialized views allow users to declaratively define data processing results without manually managing conversion steps. Users only need to write standard SQL queries to specify data conversion logic, and OceanBase is responsible for executing and maintaining these queries.
Automatic refresh mechanism: OceanBase automatically handles the cascading relationships of data refresh, including scheduling and execution, based on the user-specified target data freshness requirement. Users only need to define the latest state the data should maintain, such as a target lag of 30 seconds or 5 minutes, and OceanBase will automatically ensure the data meets this requirement.
Incremental processing optimization: Intelligent materialized views use automatic incremental view maintenance technology to compute only the changes since the last refresh, rather than performing a complete refresh. This greatly improves performance and reduces computational cost.
Dependency management: Intelligent materialized views automatically track data dependencies to ensure refreshes occur in the correct order when the underlying data changes. This eliminates the need to manually manage dependencies in complex data pipelines.

OceanBase Database is a distributed, multi-modal integrated database. In real-time analytics scenarios, it can simultaneously process structured, semi-structured, and unstructured data within a single database based on its SQL engine.

The storage foundation for multimodality is LOB (Large Object). LOB also plays a crucial role in AI storage and data processing. Multimodal data in AI scenarios (images, text, audio, video) are all large objects, and LOB performance is essential for computation efficiency in scenarios like AI pre-training. OceanBase provides a highly efficient LOB implementation at the storage layer, supporting 512 MB of storage per SQL statement. Using the DBMS.Lob package, it can support efficient access to terabyte-level large objects.

In data analytics scenarios, OceanBase natively supports multimodal data types such as Array, Roaring Bitmap, and Map. It can not only efficiently store and query structured data but also directly support semi-structured data and complex aggregation analysis.

Roaring Bitmap/Map: Suitable for advanced data mining scenarios such as large-scale label analysis, population segmentation, and deduplication aggregation.
Array: Provides great flexibility for complex businesses such as log retrieval, behavior trajectory analysis, and multi-dimensional labeling, truly achieving one database with multiple capabilities.

For detailed information and usage instructions on the highly efficient compressed bitmap data type, see Highly efficient compressed bitmap data type (RoaringBitmap).

For detailed information and usage instructions on the array type, see Array type.

JSON, as the most ubiquitous semi-structured type today, has a wide range of applications in transactional, analytical, and even AI scenarios.

In transactional scenarios, it serves as schemaless, elastic columnar storage and computation.
In analytical scenarios, JSON's multi-valued indexing is suitable for flexible computation of multi-dimensional labels.
In AI scenarios, JSON acts as a bridge connecting models and applications. Additionally, in data processing pipelines such as agents/workflows, JSON is also used as a standard input/output format.

To support JSON capabilities in multi-workload scenarios, OceanBase JSON implements a rich set of computation expressions and JSON-based multi-valued indexes. In its underlying storage format, it not only supports JSON Binary to optimize random read/write operations within JSON data but also supports structured encoding of JSON. By extracting structural information from similar JSON data, it deeply optimizes the storage compression ratio of JSON and the query performance based on JSON Path. For detailed information and usage instructions on the JSON data type, see JSON data type.

In the era of generative AI, processing multimodal data has become increasingly important. OceanBase has strengthened its vector capabilities and full-text search capabilities to better meet the demands for hybrid search in knowledge retrieval scenarios. With its fully self-developed path based on vector algorithms + database, OceanBase's vectors have clear advantages over open-source vector databases, outperforming typical open-source competitors in VectorDBbench.

Special indexes

In traditional relational databases, indexes such as B-Tree and Hash are mainly used for exact value queries on structured data (such as numbers, dates, etc.). However, in actual business scenarios, with the proliferation of semi-structured data (such as JSON documents), unstructured text (logs, long texts), and multi-dimensional analysis scenarios, OceanBase provides two special indexes:

Full-text index: Implemented based on an inverted index, it establishes keyword mappings for text content through tokenization techniques. It is suitable for scenarios such as log analysis and document retrieval.
Multi-valued index: Establishes element-level indexes for JSON array fields by unfolding the arrays into virtual row records and constructing a B-Tree index, significantly improving the query efficiency of set data.

Through their differentiated data structure designs, these two indexes provide targeted optimizations for different data types (text/JSON) and query patterns (fuzzy match), jointly forming an acceleration layer for complex data queries.

Full-text index

In relational databases, indexes are typically used to accelerate queries that match exact values. However, traditional B-Tree indexes often fail to meet performance requirements when processing large amounts of text data or performing fuzzy searches. In such cases, performing a full table scan to find matching data row by row leads to performance bottlenecks, especially in scenarios with large volumes of text and data. Moreover, complex query requirements, such as approximate match and relevance sorting, are difficult to achieve through simple SQL rewriting.

To address these issues, OceanBase currently supports full-text indexing compatible with MySQL. By preprocessing text content and establishing keyword indexes, full-text indexing significantly improves the efficiency of full-text searches.

The technology used by full-text indexing to quickly search text data includes the following main features:

Full-text search: By building a full-text index, entire documents or large segments of text can be comprehensively indexed, enabling more flexible and efficient searching.
Quick lookup: Users can quickly find matching text in the database based on entered keywords, greatly reducing search time.
Efficient handling of large amounts of text: Full-text indexing can effectively process various types of text data, including articles, reports, web pages, and emails, providing users with precise and fast search experiences.
Support for complex queries: In addition to basic keyword search, full-text indexing also supports complex query requirements, such as approximate match and relevance sorting, greatly enriching the database's search capabilities.

By introducing full-text indexing, OceanBase can significantly improve query performance when dealing with large-scale text data and complex retrieval requirements, enabling users to obtain the information they need more efficiently.

For detailed information and usage instructions on full-text indexing, see Full-text index (MySQL-compatible mode).

Multi-valued index

Multi-valued indexing is a special indexing feature in the MySQL-compatible mode of OceanBase Database. It is primarily used to handle JSON documents and set data types, and is suitable for scenarios where multiple values or attributes need to be queried. Its main characteristics are as follows:

Can create indexes on arrays or sets.
Currently applicable to JSON documents.
Can improve the query efficiency of searches based on JSON array elements.

Multi-valued indexing is very useful in AP scenarios. AP scenarios usually involve complex data analysis and report generation, and multi-valued indexing can accelerate these operations. For example, in a data warehouse, multi-valued indexing can be used to accelerate multi-dimensional data analysis and improve report generation efficiency. Specific applicable scenarios include:

Many-to-many association queries: Using multi-valued indexing can optimize queries for many-to-many relationships between entities. For example, in the relationship between actors and films, all actors for a film can be stored in a JSON array, and all films starring a particular actor can be quickly queried using a JSON multi-valued index.
Tag and category queries: When an entity has multiple tags or categories, a multi-valued index can accelerate related queries. For example, multiple tags of a product can be stored in a JSON array, and products containing a specific tag can be quickly found using a JSON multi-valued index.

For detailed information and usage instructions on multi-valued indexing, see Multi-valued index.

MySQL ecosystem compatibility

While striving for ultimate performance and scalability, OceanBase Database also aims to provide high compatibility with the MySQL ecosystem. This enables seamless migration of MySQL-based businesses to OceanBase Database and allows full utilization of existing OLAP ecosystem tools and technology stacks, enabling rapid iteration and innovation in data analysis and business insights.

Syntax compatibility: OceanBase Database fully supports the standard MySQL SQL syntax, including but not limited to Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). This means that statements you previously wrote in MySQL for data queries, table structure definitions, index creation, and privilege management can be run directly in OceanBase Database with minimal syntax adjustments, significantly reducing migration costs and the learning curve.

Seamless migration: Existing MySQL applications can be quickly migrated to OceanBase Database, reducing the workload of code modification during migration.
Skill reuse: MySQL developers and DBAs do not need to learn new database syntax, shortening the adaptation period.
Ecosystem integration: The MySQL-compatible syntax foundation enables OceanBase Database to better integrate into existing toolchains such as BI, ETL, and data visualization.

View compatibility: OceanBase is compatible with MySQL's information_schema views, for example:

Table information query: Supports views like TABLES and COLUMNS, allowing users to query the structure and column information of all tables in the database. This is crucial for data dictionary management and integration with third-party tools.
Privilege management: Supports views such as SCHEMATA and SCHEMA_PRIVILEGES, helping administrators conveniently view and manage database and table privilege settings.

Many database management, monitoring, and analysis tools rely on INFORMATION_SCHEMA to obtain database status and architecture information. This compatibility feature of OceanBase Database enables these tools to run directly on OceanBase Database without custom adaptation. OceanBase Database supports a variety of OLAP ecosystem tools, for example:

OceanBase implements a full-stack data integration from batch to streaming:

Supports seamless integration with mainstream tools such as Flink, Kafka, OMS, OBLOADER, Dataworks, and dbt, easily connecting to enterprise existing ETL, real-time synchronization, and data warehouse environments.
The database kernel supports direct load, file/ODPS/HDFS external tables, External Catalog, etc., greatly reducing the development and operation and maintenance barriers for multi-source data integration and data lake connectivity.
Highly compatible with MySQL/Oracle syntax, mainstream BI, analytics, and stream computing tools (such as Tableau, PowerBI, Flink, Quick BI, etc.) can achieve zero-code migration and access, truly achieving "out-of-the-box" use.

OceanBase innovates not only in its core engine but also adapts to many ecosystem products:

Automated orchestration and scheduling: Deeply integrates with scheduling platforms such as DolphinScheduler, n8n, and Airflow to achieve automatic orchestration and operation and maintenance of multi-stage data processing tasks, improving large-scale data governance and multi-system collaboration efficiency.
Observability: Seamlessly integrates with monitoring tools such as Prometheus and Grafana, supporting real-time performance monitoring, alerts, and intelligent operation and maintenance for multi-tenants, nodes, and clusters, facilitating stable operation of large-scale clusters and multiple business lines.
Data visualization empowerment: Fully compatible with mainstream BI platforms such as Superset, Tableau, QuickBI, and Guanduan BI, supports multi-role self-service analysis, interactive dashboards, and complex business reports, aiding business insight and data-driven decision-making.

For information on OceanBase's current OLAP ecosystem integrations, see Ecosystem integrations.

The key capabilities of AP are summarized as follows: