This topic introduces the core concepts of vector databases and vector search.
OceanBase Database supports dense float vectors with up to 16,000 dimensions, as well as sparse vectors. It supports various types of vector distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance. OceanBase also supports the creation of HNSW/IVF-based vector indexes, as well as incremental updates and deletions, with these operations having no impact on recall rate.
OceanBase vector search supports hybrid search and flexible access. You can use SQL over the MySQL protocol from clients in many languages, or use the Python or Java SDK. OceanBase is also integrated with the AI application frameworks LlamaIndex and DB-GPT and the AI platform Dify, so you can build AI applications more easily.
Key concepts
Unstructured data
Unstructured data is data that does not have a predefined data format or organizational structure. It typically includes data in forms such as text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, processing it requires specific tools and techniques, such as natural language processing, image recognition, and machine learning.
Vector
A vector is the projection of an object in a high-dimensional space. Mathematically, a vector is a floating-point array with the following characteristics:
Each element in the array is a floating-point number that represents a dimension of the vector.
The size, namely, the number of elements, of the vector array indicates the dimensionality of the entire vector space.
Embedding
Embedding is the process of using a deep learning network to extract content and semantics from unstructured data (such as images and video) and turn them into feature vectors. Embedding maps data from a high-dimensional space into a lower-dimensional space and converts rich multimodal data into multi-dimensional vectors.
Vector similarity search
In today's era of information explosion, users often need to quickly retrieve specific information from massive datasets. Whether it's online literature databases, e-commerce product catalogs, or rapidly growing multimedia content libraries, efficient retrieval systems are essential for locating content of interest. As data volumes continue to grow, traditional keyword-based search methods can no longer meet the demands for both accuracy and speed, giving rise to vector search technology. Vector similarity search uses feature extraction and vectorization to convert unstructured data—text, images, audio, and so on—into vectors, and uses similarity measures to compare them and capture semantic meaning. It delivers more accurate and efficient results than traditional keyword search.
The main difference between search and query is result accuracy: search returns approximate results and does not guarantee 100% accuracy; query returns exact results and does guarantee 100% accuracy.
For other related terms, see OceanBase glossary.
Why OceanBase vector search?
OceanBase vector search is built on OceanBase's multi-model capabilities and excels at hybrid search, scalability, high performance, high availability, cost efficiency, multi-tenancy, and data security.
Hybrid search
OceanBase supports hybrid search in two ways so one database can meet diverse storage and search needs:
- Vector and scalar data: Combine vector search with scalar filtering.
- Vector index and full-text index: Combine vector index search with full-text index search.
Full-text and vector hybrid search can also use scalar filter conditions.
Distributed scalability
As a natively distributed database, OceanBase Database's horizontal scalability and multi-partitioning capabilities allow it to support massive amounts of vector data with ease.
High performance
OceanBase Database integrates the VSAG indexing algorithm library, which demonstrates outstanding performance on the 960-dimensional GIST dataset. In the ANN-Benchmarks tests, the VSAG library significantly outperformed other algorithms.
High availability
Leveraging the Paxos and data synchronization disaster recovery solution, OceanBase Database's vector search supports disaster recovery across primary/standby setups, data centers, and geographic regions. Even for in-memory HNSW indexes, real-time access remains possible after disaster recovery switchover.
Transactions
OceanBase Database's distributed transaction capabilities, based on the Multi-Paxos protocol, ensure the consistency and integrity of vector data. It also offers effective concurrency control and fault recovery mechanisms.
Cost efficiency
OceanBase Database's storage encoding and compression capabilities significantly reduce the storage space required for vectors, helping to lower application storage costs.
Data security
OceanBase Database provides comprehensive enterprise-grade security features, including identity authentication, access control, data encryption, monitoring and alerts, and security auditing. These features effectively ensure data security in vector search scenarios.
Ease of use
OceanBase vector search offers flexible access: SQL over the MySQL protocol from clients in many languages, or the Python SDK. OceanBase is also integrated with AI application frameworks such as LangChain and LlamaIndex to support AI application development.
Comprehensive toolset
OceanBase Database features a comprehensive database toolset, supporting data development, migration, operations, diagnostics, and full lifecycle data management, ensuring robust support for the development and maintenance of AI applications.
Scenarios
Retrieval-Augmented Generation (RAG): RAG is an artificial intelligence (AI) framework that retrieves facts from external knowledge bases to provide the most accurate and latest information for Large Language Models (LLMs) and allow users to have an insight into the generation process of an LLM. RAG is commonly used in intelligent Q&A systems and knowledge bases.
Personalized recommendation: The recommendation system can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system will calculate the similarity based on the characteristics of the user, and then return items that the user may be interested in as the recommendation results, such as recommended restaurants and scenic spots.
Image search and text-to-image search: Search a large image or text database for items most similar to a query image or text. The image or text features used for search can be stored in a vector database; high-performance indexes enable fast similarity computation and return matching images or text. Typical use cases include facial recognition.