Columnar Engine
In scenarios involving large-scale complex data analysis or ad-hoc queries on massive data sets, columnar storage is one of the key capabilities of AP databases. Columnar storage is a data organization method that organizes data in a table by columns, which is different from row-based storage. In columnar storage, data in a table is physically stored in columns. During analysis, only the relevant column data required for the query calculations are scanned, avoiding the need to scan entire rows. This reduces the utilization of I/O and memory resources, thus accelerating the calculation process. Additionally, columnar storage inherently provides better conditions for data compression, making it easier to achieve a high compression ratio. This helps to reduce the storage space required and the network transmission bandwidth.
OceanBase Database adopts a log-structured merge-tree (LSM-tree) architecture. Continuous optimization of high-concurrency transaction processing capabilities (TP) ensures enhanced performance for scenarios involving random write operations, real-time updates, and strong consistency. This approach has accumulated substantial engineering experience and enabled the development of an independent storage engine technology. Additionally, the hierarchical merge feature and the static organization of data in LSM-trees make them well-suited for OLAP scenarios characterized by bulk writes and low-frequency updates. By implementing columnar data compression, hierarchical merge strategies, and storage fragment optimization, the system meets the efficient scanning demands of analytical workloads while supporting a unified architecture for both TP and AP mixed workloads.
The columnar storage is introduced in OceanBase Database V4.3 based on the previous technical accumulation to achieve a unified architecture for data storage and access. With a single OBServer, both columnar and rowar data are stored together, achieving performance in both TP and AP queries.
Columnar engine architecture
As a native distributed database, OceanBase Database stores user data in multiple replicas by default. To leverage the advantages of multiple replicas, OceanBase Database offers enhanced features such as strong data verification and data reuse during migration. The self-developed LSM-Tree storage engine has been optimized for these features. Each user data can be divided into two parts: baseline data and incremental data.
Baseline Data
Global version consistency control: By using a distributed multi-replica architecture, OceanBase Database breaks through the traditional LSM-tree design paradigm and implements the "daily minor compaction" mechanism. The system can trigger global version number selection regularly or as needed, and the tenant will periodically or based on user operations select a global version number. All replicas perform major compactions based on this version, generating baseline data for the version. The baseline data for the same version across all replicas are physically identical.
Support for polymorphic storage: Baseline data supports three physical forms: row-based, column-based, and hybrid row-column-based. Users can choose the appropriate form based on their specific business scenarios by configuring table settings.
Incremental Data
- Dynamic multi-version management: All writes after the latest baseline version (data in MemTable or data that has been flushed to disk) are considered incremental data. Each replica independently manages its multi-version records and they do not stay in sync. Incremental data includes all multi-version data, and it's not generated based on a specified version like baseline data.
- Row-based priority strategy: Incremental data must use the row-based model. This ensures compatibility with the TP chain and native row-based architecture, sharing components such as transaction logs and lock mechanisms.
Columnar Storage
Given the controllability feature in columnar storage scenarios, OceanBase Database proposes a transparent columnar storage implementation that leverages its baseline data and incremental data characteristics:
- Decouple the storage model: Baseline data is organized in a columnar format (each column is stored as an independent SSTable, logically grouped as a table). Incremental data is stored in a row-based format, ensuring complete transparency and synchronization during DML operations with upstream and downstream systems.
- Dynamically synchronized engine: Builds a bidirectional synchronization pipeline for row and column data at the lower level, supporting smooth transition for OLAP system migration and row-based storage upgrades, without the business needing to perceive the differences in storage formats.
- Intelligent routing mechanism: The system automatically selects the optimal row/column access paths based on load characteristics from the optimizer to the executor. This fully leverages the performance advantages of columnar storage in AP scenarios while retaining native support for row-based storage in TP transactions.
Core integrated capabilities
| Skill Dimension | Key Technical Implementation |
|---|---|
| SQL Integration |
|
| Integrated storage |
|
| Transactional consistency |
|
Column store replicas
To support the isolation of TP and AP resources and the analysis performance requirements in HTAP mixed workload scenarios, OceanBase Database provides column store replicas (Column Store Replicas, C replicas). Column store replicas are read-only replicas that store baseline data of all user tables in columnar format. Column store replicas are deployed in independent zones.
Column store replicas support two routing strategies: one is to access column store replicas through an independent ODP; the other is to share an ODP with row store replicas, with the optimizer automatically selecting the data replica based on rules or cost. Two read consistency levels are supported: one is to access column store replicas with weak consistency, where data can be read normally when the replica delay does not exceed max_stale_time_for_weak_consistency; the other is to access column store replicas with strong consistency, where the system retrieves the global consistent snapshot version through GTS and checks the replay status of the column store replica, waiting for the replay to complete before returning the data.
Column store replicas follow the general rules for read-only replicas in terms of replica distribution strategy, with the core difference being in the storage structure of the base data. Like regular read-only replicas, column store replicas do not participate in the leader election or log synchronization voting process, but they still fully contain core components such as static data tables, committed logs, and in-memory data tables.
For more information about column store replicas, see Column store replicas and Deploy and use column store replicas.
Main features of columnar storage
Feature 1: Adaptive Compaction
After introducing a new columnar storage mode, the data compaction behavior for columnar data differs significantly from that of row-oriented data. Incremental data is stored in row-oriented format and needs to be merged with base data and then split into individual SSTables per column. As a result, the compaction time and resource consumption increase significantly compared to row-oriented data. To optimize columnar storage compaction, the storage layer has adapted and optimized the compaction process. In addition to being able to perform horizontal splitting and parallel compaction acceleration like row-oriented tables, columnar tables also support vertical splitting acceleration. Columnar tables consolidate the compaction tasks of multiple columns into a single task. Furthermore, the number of columns within a task can be dynamically adjusted based on system resources, achieving a better balance between overall compaction speed and memory overhead.
Feature 2: Columnar encoding algorithm
OceanBase Database adopts a two-stage compression scheme for data storage. The first stage is the proprietary hybrid row-column encoding compression implemented in OceanBase Database, and the second stage is a standard general-purpose compression. The hybrid row-column encoding, being an internal algorithm of the database, supports direct querying without decompression and query acceleration through encoding information. However, the existing hybrid row-column encoding algorithm is primarily row-based. Therefore, OceanBase Database introduces a new column-oriented encoding algorithm tailored for columnar tables. Compared to the existing algorithm, the new algorithm achieves comprehensive vectorization of queries, SIMD optimization across different instruction sets, and significantly improves compression ratios for numerical types. This results in a substantial enhancement of both performance and compression efficiency over the original algorithm.
Property 3: Skip Index
Column-oriented databases typically perform pre-aggregation calculations on each column of data at certain granularities. These pre-aggregated results are persisted along with the data. When users query column data, the database can filter the data using the pre-aggregated data, significantly reducing the data access overhead and unnecessary I/O consumption. In our columnar engine, we also support skip indexes. We perform aggregations on each column of data at the micro-block granularity, considering various dimensions such as maximum, minimum, and null values. These aggregated values are then aggregated and accumulated across micro-blocks to obtain higher granularity aggregated values at the macro-block and SSTable levels. When users perform queries, they can specify the scan scope and select the appropriate granularity of aggregated values for filtering and aggregating the output.
Feature 4: Downstream Query
Starting from OceanBase Database V4.x, the execution layer has been fully adapted to support vectorized execution, and some scenarios now support query pushdown. In columnar engines, the pushdown feature has been further enhanced and expanded. Specifically, it now supports:
All filters are pushed down, and skip indexes and encoding information can be used to accelerate filtering based on the type of filter.
The implementation of aggregate pushdown for common aggregate functions. For non-GROUP BY scenarios, aggregate functions such as COUNT, MAX, MIN, SUM, and AVG have been implemented with pushdown to the storage engine.
Group by is supported with downpressure for storage computation in cases where NDV is low. The performance of the group by operation is accelerated by using the microblock dictionary information.
For more information about the columnstore engine, see Columnstore engine.
Vectorized Execution Engine
Vectorized execution is a technique for efficiently processing data in batches. Vectorized execution significantly improves the performance of analytical queries. The vectorized execution engine was introduced in OceanBase Database V3.2, but it is disabled by default. Starting from OceanBase Database V4.0, the vectorized execution engine is enabled by default and is further enhanced in OceanBase Database V4.3 with Vectorized Engine 2.0. This version optimizes data formats, operator implementations, and storage vectorization, significantly boosting the execution performance of the vectorized engine.
About data formats
In Vectorized Engine 2.0, a new columnar data format has been introduced, where null, len, and ptr data are stored separately per column to eliminate redundant storage. Three different data formats have been designed based on various data types and usage scenarios: fixed-length format, variable-length discrete format, and variable-length continuous format.
- Fixed-length data format: Only one copy of the length value is stored. Accesses are direct, which results in better data locality. Compared with the 1.0 version, this format saves space, increases efficiency, and avoids the swizzling of pointers.
- Variable-length discrete format: Data is not contiguous in memory, and each data element is described by an address pointer and length. This format can avoid deep copying during encoding and is suitable for shortcut computation scenarios, avoiding data reorganization.
- Varchar continuous format: Data is stored continuously in memory, and the length and offset information are described by the offset array. This format improves data access efficiency, but when short-circuiting or projecting columnar encoding, reorganization and deep copying are required. This format is mainly used for columnar materialization scenarios.
Optimizing operator and expression performance
The Vectorized Engine 2.0 is optimized for operators and expressions, primarily through the use of new formats and specialized data structures to reduce CPU cache misses and instruction overhead, thereby improving overall execution performance. Key optimizations include:
- Use batch data attribute information: Maintain batch data feature information to eliminate the need for special handling and filtering of NULL values and optimize SIMD calculations.
- Algorithm and data structure optimizations: Optimized the intermediate result materialization structure, supporting row/column-based materialization. The Sort operator implements separate materialization for sort keys and non-sort keys, reducing cache misses during sorting and improving overall efficiency.
- Specialized Implementation Optimization: Optimized for specific scenarios, such as encoding multiple fixed-length join keys into a single fixed-length column and implementing specialized aggregation calculations, significantly improving execution efficiency.
Storage vectorization optimization
Our storage layer fully supports the new vectorized format and uses SIMD acceleration for operations such as projection, predicate pushdown, aggregation pushdown, and groupby pushdown. Projection templates are customized based on column types and lengths, simplifying calculations. Predicate evaluation is performed directly on encoded column data, enabling fast computation for complex expressions. Aggregation pushdown leverages intermediate pre-aggregation data to handle statistical functions efficiently. Groupby pushdown utilizes encoded data information to significantly accelerate performance.
In terms of overall performance, OceanBase Database's Vectorization Engine outperforms its non-vectorized counterpart by a factor of 10 in test benchmarks:
Real-time writing
OceanBase Database uses the Log-Structured Merge-Tree (LSM-Tree) architecture, which guarantees the ability for real-time write operations. This topic describes the real-time write capability of OceanBase Database.
Core storage mechanism
LSM-Tree
To support efficient real-time writing, OceanBase uses an LSM-tree structure to store incremental data. An LSM-tree is a special type of tree structure designed for optimized write operations. The core idea is to first record write operations in a structure in memory, and then write them asynchronously to disk in batches. This approach significantly reduces disk I/O and improves write performance. OceanBase regularly merges incremental data with base data through dumping and merging processes, ensuring data consistency and integrity.
Real-time Writing Capability
OceanBase Database is based on the LSM-tree architecture, ensuring excellent performance in real-time data writing. Whether it is a small data update or a large-scale data import, OceanBase Database can respond quickly to ensure the real-time writing of data. The performance can be demonstrated in the following ways:
- Efficient write operations: By using the LSM-Tree structure, OceanBase Database processes write operations centrally, reducing disk operations and improving write efficiency.
- Real-time data accessibility: Data written to the LSM-Tree's in-memory structure becomes immediately accessible for queries, ensuring real-time data availability.
- Optimized data compaction process: Efficient queries are supported with intelligent compaction and major compaction strategies.
- High concurrency: OceanBase Database uses a distributed architecture to process write operations in parallel across multiple nodes, significantly enhancing the system's ability to handle real-time data.
Optimizer
The query optimizer of OceanBase Database is designed for HTAP and real-time analysis scenarios. Generally, transactional workloads have the following characteristics: small data access per query, high response time requirements, and higher throughput. Analytical workloads, on the other hand, have the following characteristics: large data access per query and relatively lower throughput. The execution plans and optimization methods required for these different types of workloads vary significantly.
- Transactional workloads typically require the creation of appropriate index structures for base table filter predicates and join predicates. Each table should choose the most suitable index path to significantly reduce data scanning.
- Analytical workloads typically rely on columnar storage for full table scans and use skipindex to quickly skip the scanning of some data blocks. They also increase parallelism to reduce query response time.
The AP optimizer of OceanBase Database, on the basis of fully inheriting the capabilities of the TP optimizer, has been specially strengthened for various complex query scenarios.
More comprehensive query rewriting capabilities: The query rewriting module supports a wide range of rewriting algorithms. Different rewriting algorithms match different scenarios and perform equivalent transformations to improve the business SQL in the right direction. Currently, it supports rule-based rewriting algorithms such as view merging, subquery lifting, inner join elimination, outer join elimination, and elimination of always-true or always-false conditions. It also supports cost-based rewriting algorithms such as OR Expansion, JA Subquery Lifting, Win Magic, and Group-By Placement.
Distributed plan generation: OceanBase Database adopts a one-phase distributed plan generation approach. This is different from the two-phase approach commonly used in other database systems. In the two-phase approach, an optimal single-machine plan is first generated and then distributed. However, this approach faces more challenges in HTAP scenarios. For example, in the connection enumeration process, the optimal NEST-LOOP JOIN algorithm in a single-machine scenario may result in a large amount of cross-machine data access after distribution, significantly reducing the overall execution efficiency compared to the distributed HASH JOIN algorithm. OceanBase Database's one-phase plan generation framework considers data distribution characteristics and parallelization during the enumeration of connection orders and algorithms, making comprehensive decisions to select the overall optimal distributed execution plan.
Row and column path selection: To effectively handle HTAP mixed workloads, the query optimizer has a core intelligent optimization capability: automatic row and column path selection. This means that for a query, the optimizer will decide whether to read data from row store (Row Store) or column store (Column Store) based on its data access characteristics, aiming for optimal execution performance. OceanBase Database has designed a specific cost model for columnar scans and introduced a new statistical information mechanism to accurately assess the benefits of columnar SkipIndex on columnar scans, ensuring accurate decisions between row store indexes and columnar scans.
Easier Auto DOP: Most databases use parallel execution to accelerate complex SQL execution. However, in real business scenarios, it is difficult to easily evaluate whether to enable parallel execution and the degree of parallelism. OceanBase Database provides the Auto DOP feature. When generating a plan, the optimizer will assess the query execution time and automatically determine whether to enable parallel execution and the appropriate degree of parallelism, ensuring that SQL can achieve optimal performance by default.
Execution plan management: OceanBase Database's SQL Plan Management (SPM) technology is crucial for ensuring the long-term stability of real-time analytical workloads. It intelligently manages the evolution of execution plans to address situations where the optimizer may select a worse new execution plan after data volume changes, statistical information updates, or database version upgrades. It prevents execution plan regression through gray-scale verification using real traffic.
For more information, see Overview of statistics and estimation mechanisms.
Smart materialized views
The core value of smart materialized views lies in the fact that users only need to define the data processing results through SQL, and OceanBase Database automatically manages data refreshes and computing dependencies. This means that enterprises do not need to write complex ETL scripts or manage data pipelines. They only need to define the target data structure, and OceanBase Database will automatically handle data refreshes, dependency management, and performance optimization. This approach greatly simplifies the data engineering process, reduces operational costs, and ensures the freshness and consistency of data.
OceanBase Database supports asynchronous materialized views. That is, when the data in the base table changes, the materialized view is not immediately updated. This ensures the execution performance of DML operations on the base table. However, the data in the materialized view will be delayed relative to the base table and will need to be refreshed and updated in a timely manner. The update of materialized view data supports two strategies: full refresh and incremental refresh.
- A full refresh directly re-executes the query statements corresponding to the materialized view, fully calculates, and overwrites the original view result data. This strategy is suitable for scenarios with low latency requirements, low base table data update frequency, or small data volumes, such as data aggregation reports updated once a day or a week.
- An incremental refresh only processes the data changes since the last refresh. This strategy can significantly reduce the time and resources required for refreshes. To achieve accurate incremental refreshes, OceanBase Database implements a materialized view log feature similar to Oracle's MLOG (Materialized View Log). By detailedly recording incremental updates of the base table, OceanBase Database ensures that materialized views can quickly perform incremental refreshes. This strategy is particularly suitable for scenarios with high latency requirements, large data volumes, and frequent changes, such as in real-time transaction systems where data may change every minute or even every second.
OceanBase Database also supports real-time materialized views, which enable real-time data analysis. Real-time materialized views use the materialized view log mechanism to capture and process changes in the base table data. During queries, the system calculates and integrates data changes in real time, even if the materialized view does not physically store the latest changes. This ensures that users receive query results consistent with those obtained by directly querying the base table. Additionally, OceanBase Database leverages the query rewriting capabilities of materialized views to achieve transparent query acceleration.
OceanBase Database allows you to specify a primary key or create an index for a materialized view, thereby optimizing the performance of single-row lookups, range queries, and join operations based on the primary key or index. If a materialized view is a large wide table formed by joining multiple tables, you can create a columnar storage materialized view to improve the performance of certain queries. Furthermore, by creating partitioned materialized views, you can leverage partition pruning to reduce the amount of data processed.
OceanBase Database supports nested materialized views (Nested Materialized View, Nested MV) and cascading refresh capabilities, allowing you to build new materialized views on existing ones. This is particularly useful in data warehouse scenarios involving hierarchical processing and intermediate result reuse. Combined with incremental refresh, MLOG automated maintenance, and query rewriting capabilities, it reduces the complexity of maintaining the precomputation pipeline.
For more information about materialized views, including detailed descriptions and usage guidance, refer to Overview of materialized views (MySQL mode) and Overview of materialized views (Oracle mode).
Key features of smart materialized views:
- Declarative SQL Definition: Smart materialized views allow users to declaratively define data processing results without manually managing transformation steps. Users only need to write standard SQL queries to specify data transformation logic, and OceanBase Database handles the execution and maintenance of these queries.
- Automatic Refresh Mechanism: OceanBase Database automatically handles the cascading relationships of data refreshes based on the user-specified target data freshness requirements, including scheduling and execution. Users only need to define the desired level of data freshness, such as a 30-second or 5-minute target lag, and OceanBase Database ensures the data meets this requirement.
- Incremental Processing Optimization: Smart materialized views use automatic incremental view maintenance technology, calculating only the changes since the last refresh instead of performing a full refresh. This significantly improves performance and reduces computing costs.
- Dependency Management: Smart materialized views automatically track data dependencies, ensuring that refreshes occur in the correct order when base data changes. This eliminates the need to manually manage complex data pipeline dependencies.
Multi-modal types
OceanBase Database is a distributed multi-modal database that can process structured, semi-structured, and unstructured data simultaneously in a single database in real-time analytics scenarios.
The storage foundation for multi-modal data is LOB (Large Object). LOB plays a crucial role in AI storage and data processing. In AI scenarios, multi-modal data such as images, text, audio, and video are all large objects. The performance of LOB is essential for the computational efficiency of AI pre-training scenarios. OceanBase provides highly efficient LOB implementation at the storage layer, supporting up to 512 MB of storage under SQL. Using the DBMS.Lob package, it can efficiently store and retrieve objects of TB scale.
In data analysis scenarios, OceanBase natively supports multi-modal data types such as Array, Roaring Bitmap, and Map. These data types not only enable efficient storage and querying of structured data but also support semi-structured data and complex aggregations.
- Roaring Bitmap/Map: Suitable for advanced data mining scenarios such as large-scale tag analysis, audience selection, and deduplication aggregation.
- Array: Provides great flexibility for complex business scenarios such as log retrieval, behavior tracking, and multi-dimensional tagging, truly achieving one database for multiple capabilities.
For detailed information about the highly compressed bitmap data type (RoaringBitmap), see Highly compressed bitmap data type (RoaringBitmap).
For detailed information about the array data type, see Array data type.
JSON is the most widely used semi-structured data type and has extensive applications in transactional, analytical, and AI scenarios.
- In transactional scenarios, JSON serves as a schemaless, flexible columnar storage and computation engine.
- In analytical scenarios, multi-value indexing on JSON supports flexible computations for multi-dimensional tags.
- In AI scenarios, JSON acts as a bridge between models and applications. It also serves as a standard input/output format in data processing pipelines such as agents and workflows.
To support JSON across various workloads, OceanBase provides rich computation expressions and multi-value indexing based on JSON. At the storage layer, it supports JSON Binary for optimized random read/write performance and structured encoding for JSON. By extracting structured information from similar JSON data, it significantly optimizes the storage compression ratio and query performance based on JSON Path. For detailed information about the JSON data type, see JSON data type.
In the era of generative AI, the demand for multi-modal data processing is continuously growing. OceanBase continues to enhance its vector and full-text search capabilities. It also supports hybrid search, allowing the combination of full-text, vector, and structured filtering conditions in a single query to meet the needs of knowledge retrieval and sorting.
Special indexes
In traditional relational databases, indexes such as B-Tree and Hash are primarily used for precise value queries on structured data (such as numbers and dates). However, in practical business scenarios, with the increasing prevalence of semi-structured data (such as JSON documents), unstructured text (such as logs and long texts), and hybrid search scenarios, OceanBase Database provides the following special indexing capabilities:
- Full-text indexes: Based on inverted indexes, full-text indexes use tokenization technology to establish keyword mappings for text content, making them suitable for log analysis and document retrieval.
- Multi-value indexes: These indexes are created for JSON array fields, expanding the array into virtual rows and building B-Tree indexes on them, significantly improving query efficiency for set data.
- Search Indexes (Search Index): These indexes accelerate queries on JSON, single-level arrays, and various scalar types. They are commonly used for semi-structured filtering in hybrid search scenarios and for multi-column indexing and index merging in analytical wide tables.
These capabilities, through differentiated data structures and collaboration with the optimizer, provide targeted acceleration for text, semi-structured, and hybrid search scenarios.
Full-text indexes
In relational databases, indexes are typically used to accelerate queries for exact value matches. However, traditional B-Tree indexes often fail to meet performance requirements when handling large volumes of text data and fuzzy searches. In such cases, a full table scan to find matching data row by row can lead to performance bottlenecks, especially when dealing with large amounts of text and data. Additionally, complex query requirements such as approximate matching and relevance sorting are difficult to achieve through simple SQL rewriting.
To address these issues, OceanBase Database currently supports full-text indexing functionality compatible with MySQL. Full-text indexes preprocess text content to build keyword indexes, significantly enhancing the efficiency of full-text searches.
Full-text indexes are used for quickly searching text data. Their main features include:
- Full-text search: By building full-text indexes, you can comprehensively index entire documents or large segments of text, enabling more flexible and efficient searches.
- Quick lookup: Users can quickly find matching text in the database based on input keywords, significantly reducing search time.
- Efficiently process large volumes of text: Full-text indexes can effectively handle various types of text data, including articles, reports, web pages, and emails, providing users with precise and fast search experiences.
- Support complex queries: In addition to basic keyword searches, full-text indexes can support complex query requirements such as approximate matching and relevance sorting, greatly enriching the database's search capabilities.
By introducing full-text indexing functionality, OceanBase Database can significantly improve query performance when dealing with large-scale text data and complex retrieval needs, allowing users to efficiently obtain the information they need.
For more information about full-text indexes, see Full-text indexes (MySQL mode).
Multi-value indexes
Multi-value indexes are a special indexing feature in OceanBase Database's MySQL mode, primarily used for handling JSON documents and set data types. They are suitable for scenarios where multiple values or multiple attributes need to be queried. Their main features are as follows:
- You can create indexes on arrays or sets.
- Currently, they are applicable to JSON documents.
- They can improve the query efficiency of searches based on JSON array elements.
Multi-value indexes are particularly useful in AP (Analytics and Processing) scenarios, which typically involve complex data analysis and report generation. Multi-value indexes can accelerate these operations. For example, in a data warehouse, multi-value indexes can speed up multidimensional data analysis and improve the efficiency of report generation. Specific applicable scenarios include:
- Many-to-many association queries: Multi-value indexes can optimize queries for many-to-many relationships between entities. For example, in the relationship between actors and movies, you can store all actors of a movie in a JSON array and use a JSON multi-value index to quickly query all movies an actor has participated in.
- Tag and category queries: When entities have multiple tags or categories, multi-value indexes can accelerate relevant queries. For example, multiple tags of a product can be stored in a JSON array, and a JSON multi-value index can be used to quickly find products with specific tags.
For more information about multi-value indexes, see Multi-value indexes.
Search index
A search index is built based on the inverted index principle. It organizes the values (and JSON paths and values) of the indexed columns into index entries that can be quickly filtered. A single search index can cover multiple columns, and JSON columns can be finely controlled with the WITH clause to specify paths and types. This makes search indexes suitable for filtering semi-structured data and wide tables with multiple columns. They can also be combined with B-tree, full-text, and vector indexes to support intersect and union operations between multiple indexes, minimizing unnecessary table scans.
Search indexes are commonly used in two scenarios:
- Hybrid search scenarios: When a table has vector and full-text queries and includes filtering conditions on scalar, JSON, or single-level array columns, search indexes can be used with vector and full-text indexes to improve the overall performance of hybrid search.
- Ad-hoc scenarios: Search indexes support creating indexes on multiple columns, aligning with the modeling habit of "indexing multiple columns or even all columns."
In addition, when multiple indexes are available on the same table, you can use multi-index joint queries. The optimizer automatically identifies the optimal index combinations and selects the best joint scanning method. This reduces full-table scans when multiple column predicates are present, providing a query experience similar to that of analytical products.
For more information about search indexes, see Search index.
MySQL ecosystem compatibility
OceanBase Database is not only committed to delivering exceptional performance and scalability but also strives to ensure high compatibility with the MySQL ecosystem. This allows for seamless migration of MySQL applications to OceanBase Database and leverages existing OLAP tools and technologies for rapid iteration and innovation in data analysis and business insights.
Syntax compatibility: OceanBase Database fully supports the SQL standard syntax of MySQL, including but not limited to Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). This means that most of the statements you used in MySQL for data queries, table structure definitions, index creation, and permission management can run directly in OceanBase Database with minimal syntax adjustments, significantly reducing migration costs and the learning curve.
- Seamless migration: Existing MySQL applications can be quickly migrated to OceanBase Database, minimizing code modifications during the process.
- Skill reuse: MySQL developers and DBAs do not need to learn new database syntax, shortening the adaptation period.
- Ecosystem integration: The compatibility with MySQL syntax allows OceanBase Database to integrate smoothly with existing BI, ETL, and data visualization tools.
View compatibility: OceanBase Database is compatible with MySQL's information_schema views, such as:
- Table information queries: Supports views like TABLES and COLUMNS, allowing users to query the structure and column information of all tables in the database. This is crucial for data dictionary management and integration with third-party tools.
- Permission management: Supports views like SCHEMATA and SCHEMA_PRIVILEGES, helping administrators easily view and manage database and table permissions.
Many database management, monitoring, and analysis tools rely on INFORMATION_SCHEMA to retrieve database status and architecture information. OceanBase Database's compatibility with this schema allows these tools to run directly on OceanBase Database without customization. OceanBase Database supports various OLAP ecosystem tools, including:
OceanBase Database supports full-stack data integration from batch to stream processing:
- Seamlessly integrates with mainstream tools like Flink, Kafka, OMS, OBLOADER, DataWorks, and dbt, making it easy to connect with existing enterprise ETL, real-time synchronization, and data warehouse environments.
- The database kernel supports features like direct load, external tables for files/ODPS/HDFS, and External Catalog, significantly lowering the development and operational barriers for integrating multi-source data and data lakes.
- Highly compatible with MySQL and Oracle syntax, mainstream BI, analytics, and stream processing tools (such as Tableau, Power BI, Flink, and Quick BI) can be migrated with zero-code effort, achieving "out-of-the-box" functionality.
OceanBase Database not only innovates in its core engine but also integrates with numerous ecosystem products:
- Automated orchestration and scheduling: Deeply integrated with scheduling platforms like DolphinScheduler, n8n, and Airflow, enabling automatic orchestration and maintenance of multi-stage data processing tasks, enhancing large-scale data governance and multi-system collaboration efficiency.
- Observability: Seamlessly integrated with monitoring tools like Prometheus and Grafana, supporting real-time performance monitoring, alerts, and intelligent operations for multi-tenant, node, and cluster environments, facilitating stable operation of large-scale clusters and multi-business lines.
- Data visualization empowerment: Fully compatible with mainstream BI platforms like Superset, Tableau, Quick BI, and Guan Yuan BI, supporting self-service analysis by multiple roles, interactive dashboards, and complex business reports, driving business insights and data-driven decision-making.
For more information about the current OLAP ecosystem integration of OceanBase Database, see Ecosystem integration.
