This topic describes the core concepts of vector databases and vector search.
OceanBase Database supports dense vectors of up to 16,000 dimensions of the float type. It supports the calculation of various vector distances such as Manhattan distance, Euclidean distance, inner product, and cosine distance. It supports the creation of vector indexes based on HNSW, and supports incremental updates and deletions. These operations do not affect the recall rate. OceanBase Database supports integrated queries with scalar filtering. It provides flexible access interfaces and supports SQL access through various language clients based on the MySQL protocol, as well as access using the Python SDK. OceanBase Database has also been adapted for AI application development frameworks such as LlamaIndex, DB-GPT, and AI application development platforms like Dify, better serving AI application development.
Key concepts
Unstructured data
Unstructured data refers to data without a predefined data format or organizational structure. Unstructured data typically includes data in the forms of text, images, audio, and video, as well as social media content, emails, and log files. Due to the complexity and diversity of unstructured data, specific tools and techniques are required for processing it, such as natural language processing, image recognition, and machine learning.
Vectors
A vector is essentially a projection of an object in a high-dimensional space. In mathematics, a vector is a floating-point array with the following characteristics:
Each element in the array represents a dimension of the vector, and each element is a floating-point number.
The size of the vector array (number of elements) indicates the dimensionality of the entire vector space.
Vector embedding
Vector embedding, also known as embedding, refers to the process of extracting content and semantics from unstructured data using deep learning neural networks, and converting images, videos, and other data into feature vectors. Embedding technology maps the original data from a high-dimensional space to a low-dimensional space, converting rich-feature multimodal data into multidimensional vector data.
Vector similarity search
In today's era of information explosion, users often need to quickly retrieve the required information from a large amount of data. For example, online document databases, product catalogs of e-commerce platforms, and multimedia content libraries are constantly growing, requiring efficient search systems to quickly locate content of interest. As the amount of data continues to surge, traditional keyword-based search methods can no longer meet users' demands for search accuracy and speed. Vector search technology has emerged to address these challenges. Vector similarity search uses feature extraction and vectorization techniques to convert structured and unstructured data such as text, images, and audio into vectors. It then compares these vectors using similarity measurement methods to capture the deep semantics of the data, thereby providing more accurate and efficient search results.
Why choose OceanBase vector search?
OceanBase Database's vector search capability is built on its multi-model integration capabilities, delivering outstanding performance in areas such as integrated queries, scalability, high performance, high availability, low costs, multi-tenancy, and data security.
Mixed queries
OceanBase Database supports mixed queries of vector data, spatial data, document data, and scalar data. It provides extreme performance for multimodal mixed queries supported by vector indexes, spatial indexes, and full-text indexes.
Scalability
As a natively distributed database, OceanBase Database supports horizontal scaling and multi-partition capabilities, enabling it to handle massive vector data.
High performance
OceanBase Database's vector search capability incorporates the VSAG index algorithm library. VSAG demonstrates excellent performance on the 960-dimensional GIST dataset and significantly outperforms other algorithms in the ANN-Benchmarks test.
High availability
OceanBase Database supports primary/standby and cross-region disaster recovery for vector search, ensuring real-time access to the HNSW index after disaster recovery switching.
Transactions
OceanBase Database's distributed transaction capability, based on the Multi-Paxos protocol, ensures the consistency and integrity of vector data. It also provides effective concurrency control and fault recovery mechanisms.
Low costs
OceanBase Database's storage encoding and compression capabilities can significantly reduce vector storage space requirements, thereby lowering application storage costs.
Data security
OceanBase Database supports a comprehensive range of enterprise-level security features, including identity authentication, access control, data encryption, monitoring and alerting, and security auditing, ensuring data security in vector search scenarios.
Ease of use
OceanBase Database provides flexible access interfaces. It supports accessing vector search services through SQL queries via MySQL protocols and various language clients, as well as through the Python SDK. Additionally, OceanBase Database has been adapted to support AI application development frameworks like LangChain and LlamaIndex, better serving AI application development.
Robust toolset
OceanBase Database offers a comprehensive suite of database tools for data development, migration, operations, and diagnostics, ensuring seamless AI application development and maintenance.
Application scenarios
Retrieval Augmented Generation (RAG): RAG is an AI framework that retrieves facts from external knowledge bases to provide the largest language models (LLMs) with the most accurate and up-to-date information. It also allows users to gain a deeper understanding of the LLM generation process. RAG is commonly used in intelligent Q&A and knowledge bases.
Personalized recommendation: Recommendation systems can recommend items that users may be interested in based on their historical behavior and preferences. When a recommendation request is initiated, the system calculates the similarity based on user characteristics and returns items that the user may be interested in as the recommendation results. Examples include restaurant recommendations and attraction recommendations.
Image search and text-based image search: An image search task involves searching for the image most similar to the specified one in a large-scale image database. A text-based image search task involves searching for the image most similar to the specified one in a large-scale text database. During the search, text/image features used can be stored in a vector database. High-performance indexes can be used to implement efficient similarity search, and then return images/texts that match the search content. Examples include face recognition.