This topic introduces the core concepts of vector databases and vector search.
OceanBase Database supports dense float vectors with up to 16,000 dimensions, as well as sparse vectors. It supports various types of vector distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance. OceanBase also supports the creation of HNSW/IVF-based vector indexes, as well as incremental updates and deletions, with these operations having no impact on recall rate.
OceanBase vector search offers hybrid retrieval capabilities with scalar filtering. It also provides flexible access interfaces: you can use SQL via the MySQL protocol from clients in various programming languages, or access it using a Python SDK. In addition, OceanBase Database is fully adapted to AI application development frameworks such as LlamaIndex, DB-GPT, and the AI application development platform Dify, offering better support for AI application development.
Key concepts
Unstructured data
Unstructured data is data that does not have a predefined data format or organizational structure. It typically includes data in forms such as text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, processing it requires specific tools and techniques, such as natural language processing, image recognition, and machine learning.
Vector
A vector is the projection of an object in a high-dimensional space. Mathematically, a vector is a floating-point array with the following characteristics:
Each element in the array is a floating-point number that represents a dimension of the vector.
The size, namely, the number of elements, of the vector array indicates the dimensionality of the entire vector space.
Vector embedding (Embedding)
Vector embedding (Embedding) is the process of using a deep learning neural network to extract content and semantics from unstructured data such as images and videos, and convert them into feature vectors. Embedding technology maps original data from a high-dimensional (sparse) space to a low-dimensional (dense) space and converts multimodal data with rich features into a multi-dimensional array (vector).
Vector similarity search
In today's era of information explosion, users often need to quickly retrieve specific information from massive datasets. Whether it's online literature databases, e-commerce product catalogs, or rapidly growing multimedia content libraries, efficient search systems are essential for locating content of interest. As data volumes continue to grow, traditional keyword-based search methods can no longer meet the demands for both accuracy and speed, giving rise to vector search technology. Vector similarity search uses feature extraction and vectorization techniques to convert unstructured data—such as text, images, and audio—into vectors. By applying similarity measurement methods to compare these vectors, it captures the deeper semantic meaning of the data. This approach delivers more precise and efficient search results, addressing the shortcomings of traditional search methods.
For other related terms, see OceanBase glossary.
Why OceanBase vector search?
OceanBase Database's vector search are built on its integrated multi-model capabilities, excelling in areas such as hybrid query processing, scalability, high performance, high availability, cost efficiency, multi-tenancy, and data security.
Multi-model fusion queries
OceanBase Database supports fusion queries across multiple data types, including vector data, spatial data, document data, and scalar data. With support for various indexes such as vector indexes, spatial indexes, and full-text indexes, OceanBase Database delivers exceptional performance in multi-model fusion queries. It enables a single database to handle diverse storage and retrieval needs for applications.
Distributed scalability
As a natively distributed database, OceanBase Database's horizontal scalability and multi-partitioning capabilities allow it to support massive amounts of vector data with ease.
High performance
OceanBase Database integrates the VSAG indexing algorithm library, which demonstrates outstanding performance on the 960-dimensional GIST dataset. In the ANN-Benchmarks tests, the VSAG library significantly outperformed other algorithms.
High availability
Leveraging the Paxos and data synchronization disaster recovery solution, OceanBase Database's vector search supports disaster recovery across primary/standby setups, data centers, and geographic regions. Even for in-memory HNSW indexes, real-time access remains possible after disaster recovery switchover.
Transactions
OceanBase Database's distributed transaction capabilities, based on the Multi-Paxos protocol, ensure the consistency and integrity of vector data. It also offers effective concurrency control and fault recovery mechanisms.
Cost efficiency
OceanBase Database's storage encoding and compression capabilities significantly reduce the storage space required for vectors, helping to lower application storage costs.
Data security
OceanBase Database provides comprehensive enterprise-grade security features, including identity authentication, access control, data encryption, monitoring and alerts, and security auditing. These features effectively ensure data security in vector search scenarios.
Ease of use
OceanBase Database offers flexible access interfaces, enabling SQL access through MySQL protocol clients across various programming languages, as well as seamless integration via a Python SDK. Furthermore, OceanBase has been optimized for AI application development frameworks like LangChain and LlamaIndex, significantly enhancing its capabilities to support AI-driven solutions.
Comprehensive toolset
OceanBase Database features a comprehensive database toolset, supporting data development, migration, operations, diagnostics, and full lifecycle data management, ensuring robust support for the development and maintenance of AI applications.
Scenarios
Retrieval-Augmented Generation (RAG): RAG is an artificial intelligence (AI) framework that retrieves facts from external knowledge bases to provide the most accurate and latest information for Large Language Models (LLMs) and allow users to have an insight into the generation process of an LLM. RAG is commonly used in intelligent Q&A systems and knowledge bases.
Personalized recommendation: The recommendation system can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system will calculate the similarity based on the characteristics of the user, and then return items that the user may be interested in as the recommendation results, such as recommended restaurants and scenic spots.
Image/Text search: An image/text search task aims to find results that are most similar to the specified image in a large-scale image/text database. The text/image features used in the search can be stored in a vector database, and efficient similarity calculation can be achieved based on high-performance index-based storage, thereby returning image/text results that match the search criteria. This applies to scenarios such as facial recognition.