OceanBase OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. You can store vectorized data in OceanBase Database, making it available for fast and efficient search.
CamelAI is transforming how teams interact with their data—simply ask questions in natural language, and instantly receive accurate SQL queries, intelligent analysis, and visualized insights.
Prerequisites
You have deployed OceanBase Database V4.4.0 or later, and created a MySQL-compatible tenant. After creating the tenant, continue with the steps below.
Your environment includes an active MySQL-compatible tenant, a MySQL database, and a user account with read and write privileges.
Python 3.11 or above is installed.
Required dependencies are installed:
python3 -m pip install "unstructured[pdf]" camel-ai pyobvectorMake sure you have set the
ob_vector_memory_limit_percentageparameter in your tenant to enable vector search. A recommended value is30. For details on configuring this parameter, refer to ob_vector_memory_limit_percentage.
Step 1: Get your database connection information
Reach out to your OceanBase administrator or deployment team to obtain the database connection string, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
$host: The IP address for connecting to OceanBase Database. If you are using OceanBase Database Proxy (ODP), use the ODP address. For direct connections, use the OBServer node IP.$port: The port number for connecting to OceanBase Database. The default for ODP is2883(can be customized during ODP deployment). For direct connections, the default is2881(customizable during OceanBase deployment).$database_name: The name of the database you want to access.Notice
The user connecting to the tenant must have
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more details on user privileges, see privilege types in MySQL-compatible mode.$user_name: The user account for connecting to the tenant. For ODP, common formats areusername@tenant_name#cluster_nameorcluster_name:tenant_name:username; for direct connections, useusername@tenant_name.$password: The password for the account.
For more details about connection strings, see Connect to an OceanBase tenant using OBClient.
Step 2: Build your AI assistant
Set environment variables
Get your Jina AI API key, and set it along with your OceanBase connection details in your environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Load data
CamelAI supports various embedding models, such as OpenAIEmbedding, VisionLanguageEmbedding, and JinaEmbedding. In this example, we will use Jina Embedding's jina-embeddings-v3 model:
import os
import requests
from camel.embeddings import JinaEmbedding
from camel.storages.vectordb_storages import (
OceanBaseStorage,
VectorDBQuery,
VectorRecord,
)
from camel.storages import OceanBaseStorage
from camel.retrievers import VectorRetriever
from camel.types import EmbeddingModelType
documents = [
"""Artificial Intelligence (AI) is a branch of computer science that aims to create systems capable of performing tasks that typically require human intelligence. AI encompasses multiple subfields including machine learning, deep learning, natural language processing, and computer vision.""",
"""Machine Learning is a subset of artificial intelligence that enables computers to learn and improve without being explicitly programmed. The main types of machine learning include supervised learning, unsupervised learning, and reinforcement learning.""",
"""Deep Learning is a branch of machine learning that uses multi-layered neural networks to simulate how the human brain works. Deep learning has achieved breakthrough progress in areas such as image recognition, speech recognition, and natural language processing.""",
"""Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP applications include machine translation, sentiment analysis, text summarization, and chatbots.""",
"""Computer Vision is a field of artificial intelligence that aims to enable computers to identify and understand content in digital images and videos. Applications include facial recognition, object detection, medical image analysis, and autonomous vehicles.""",
"""Reinforcement Learning is a machine learning method where an agent learns how to make decisions through interaction with an environment. The agent optimizes its behavioral strategy through trial and error and reward mechanisms.""",
"""Neural Networks are computational models inspired by biological neural systems, composed of interconnected nodes (neurons). Neural networks can learn complex patterns and relationships and serve as the foundation for deep learning.""",
"""Large Language Models (LLMs) are natural language processing models based on deep learning. These models are trained on vast amounts of text data and can generate human-like text and answer questions.""",
"""Transformer architecture is a neural network architecture that has revolutionized natural language processing. It uses attention mechanisms to process sequential data and forms the basis for models like GPT and BERT.""",
"""Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications."""
]
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
embedding = JinaEmbedding(
api_key=JINAAI_API_KEY,
model_type=EmbeddingModelType.JINA_EMBEDDINGS_V3)
Connect to the OceanBase cluster, define the vector table structure, and generate and store embedding vectors in OceanBase
Create a table named my_ob_vector_table with a fixed structure containing id, embedding, and metadata columns. Use the Jina AI Embeddings API to generate embedding vectors for each piece of text, and then store them in OceanBase:
OB_URI = os.getenv('OCEANBASE_DATABASE_URL')
OB_USER = os.getenv('OCEANBASE_DATABASE_USER')
OB_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OB_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
# create table
ob_storage = OceanBaseStorage(
vector_dim=embedding.get_output_dim(),
table_name="my_ob_vector_table",
uri=OB_URI,
user=OB_USER,
password=OB_PASSWORD,
db_name=OB_DB_NAME,
distance="cosine"
)
vector_retriever = VectorRetriever(
embedding_model=embedding, storage=ob_storage
)
for i, doc in enumerate(documents):
print(f"Processing document {i+1}/{len(documents)}")
vector_retriever.process(content=doc)
Semantic search
Use the Jina AI API to generate an embedding vector for your query text. Then, search for the most relevant documents by calculating the cosine distance between the query's embedding vector and each embedding vector in the vector table:
retrieved_info = vector_retriever.query(query="What is generative AI?", top_k=1)
print(retrieved_info)
Expected result
[{'similarity score': '0.8538218656447916', 'content path': 'Generative AI refers to artificial intelligence systems that can create new content, including text,', 'metadata': {'piece_num': 1}, 'extra_info': {}, 'text': 'Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications.'}]