OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.
CamelAI revolutionizes team data interaction by allowing natural language queries to instantly obtain precise SQL queries, intelligent analysis, and visualizations.
This tutorial demonstrates how to use the Jina AI API to integrate the vector retrieval feature of OceanBase Cloud with CamelAI for similarity search and retrieval tasks.
Prerequisites
A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later.
You have installed the required dependencies.
python3 -m pip install "unstructured[pdf]" camel-ai pyobvector
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Build your AI assistant
Set environment variables
Obtain the Jina AI API key and configure it along with the OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Load data
CamelAI supports various embedding models, such as OpenAIEmbedding, VisionLanguageEmbedding, and JinaEmbedding. Here, we use the jina-embeddings-v3 model of Jina Embedding as an example:
import os
import requests
from camel.embeddings import JinaEmbedding
from camel.storages.vectordb_storages import (
OceanBaseStorage,
VectorDBQuery,
VectorRecord,
)
from camel.storages import OceanBaseStorage
from camel.retrievers import VectorRetriever
from camel.types import EmbeddingModelType
documents = [
"""Artificial Intelligence (AI) is a branch of computer science that aims to create systems capable of performing tasks that typically require human intelligence. AI encompasses multiple subfields including machine learning, deep learning, natural language processing, and computer vision.""",
"""Machine Learning is a subset of artificial intelligence that enables computers to learn and improve without being explicitly programmed. The main types of machine learning include supervised learning, unsupervised learning, and reinforcement learning.""",
"""Deep Learning is a branch of machine learning that uses multi-layered neural networks to simulate how the human brain works. Deep learning has achieved breakthrough progress in areas such as image recognition, speech recognition, and natural language processing.""",
"""Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP applications include machine translation, sentiment analysis, text summarization, and chatbots.""",
"""Computer Vision is a field of artificial intelligence that aims to enable computers to identify and understand content in digital images and videos. Applications include facial recognition, object detection, medical image analysis, and autonomous vehicles.""",
"""Reinforcement Learning is a machine learning method where an agent learns how to make decisions through interaction with an environment. The agent optimizes its behavioral strategy through trial and error and reward mechanisms.""",
"""Neural Networks are computational models inspired by biological neural systems, composed of interconnected nodes (neurons). Neural networks can learn complex patterns and relationships and serve as the foundation for deep learning.""",
"""Large Language Models (LLMs) are natural language processing models based on deep learning. These models are trained on vast amounts of text data and can generate human-like text and answer questions.""",
"""Transformer architecture is a neural network architecture that has revolutionized natural language processing. It uses attention mechanisms to process sequential data and forms the basis for models like GPT and BERT.""",
"""Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications."""
]
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
embedding = JinaEmbedding(
api_key=JINAAI_API_KEY,
model_type=EmbeddingModelType.JINA_EMBEDDINGS_V3)
Define the vector table structure and store the vectors in OceanBase
Create a table named my_ob_vector_table with a fixed structure of id, embedding, and metadata. Use the Jina AI Embeddings API to generate an embedding vector for each text segment and store it in OceanBase:
OB_URI = os.getenv('OCEANBASE_DATABASE_URL')
OB_USER = os.getenv('OCEANBASE_DATABASE_USER')
OB_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OB_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
# create table
ob_storage = OceanBaseStorage(
vector_dim=embedding.get_output_dim(),
table_name="my_ob_vector_table",
uri=OB_URI,
user=OB_USER,
password=OB_PASSWORD,
db_name=OB_DB_NAME,
distance="cosine"
)
vector_retriever = VectorRetriever(
embedding_model=embedding, storage=ob_storage
)
for i, doc in enumerate(documents):
print(f"Processing document {i+1}/{len(documents)}")
vector_retriever.process(content=doc)
Perform semantic search
Generate an embedding vector for the query text using the Jina AI API and search for the most relevant documents based on the cosine distance between the query text's embedding vector and each embedding vector in the vector table:
retrieved_info = vector_retriever.query(query="What is generative AI?", top_k=1)
print(retrieved_info)
Expected results
[{'similarity score': '0.8538218656447916', 'content path': 'Generative AI refers to artificial intelligence systems that can create new content, including text,', 'metadata': {'piece_num': 1}, 'extra_info': {}, 'text': 'Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications.'}]