OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.
Cloudflare Workers AI is a service provided by Cloudflare that allows developers to run machine learning models on its global network. Developers can easily integrate AI capabilities into their applications using RESTful APIs.
This tutorial demonstrates how to integrate the vector retrieval feature of OceanBase Cloud with Cloudflare Workers AI to perform similarity search and retrieval tasks.
Prerequisites
A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later.
You have installed the required dependencies.
python3 -m pip install pyobvector requests sqlalchemy httpx
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Build your AI assistant
Set the Cloudflare API key environment variables
Obtain the Cloudflare API key and configure it in the environment variables along with the OceanBase connection information.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export CLOUDFLARE_API_KEY=YOUR_CLOUDFLARE_API_KEY
Sample code snippet
Here's an example using the bge-base-en-v1.5 model with the Cloudflare Workers AI Embedding API to generate vector data:
import requests, os, httpx
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
documents = [
"Machine learning is the core technology of artificial intelligence",
"Python is the preferred programming language for data science",
"Cloud computing provides elastic and scalable computing resources",
"Blockchain technology ensures data security and transparency",
"Natural language processing helps computers understand human language"
]
BASE_URL = "https://api.cloudflare.com/client/v4/accounts"
model_name = "@cf/baai/bge-base-en-v1.5"
account_id = "0f390650bbe6ff23336badcf24e85c93"
CLOUDFLARE_API_KEY = os.getenv('CLOUDFLARE_API_KEY')
api_url = f"{BASE_URL}/{account_id}/ai/run/{model_name}"
# Create an HTTP client
httpclient = httpx.Client()
httpclient.headers.update({
"Authorization": f"Bearer {CLOUDFLARE_API_KEY}",
"Accept-Encoding": "identity"
})
payload = {"text": documents}
response = httpclient.post(api_url, json=payload)
embedding_response = response.json()["result"]["data"]
data = []
for i, text in enumerate(documents):
data.append({
'content': text,
'content_vec': embedding_response[i] # Convert to list format
})
print(f"Successfully processed {len(data)} texts")
Define the vector table structure and store the vectors in OceanBase
Create a table named cloudflare_oceanbase_demo_documents with columns for storing text (content), embedding vectors (content_vec), and vector index information. Store the vector data in OceanBase:
#Connect OceanBase Serverless
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
ob_client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
#Create the vector table.
table_name = "cloudflare_oceanbase_demo_documents"
ob_client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(500), nullable=False),
Column("content_vec", VECTOR(768))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
ob_client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
ob_client.insert(table_name, data=data)
Perform semantic search
Generate an embedding vector for the query text using the Cloudflare Workers AI Embedding API, and then search for the most relevant documents based on the cosine distance between the query vector and each vector in the table:
# Query the most relevant document based on the query.
query = "Programming languages for data analysis"
# Generate the embedding for the query via Jina AI API.
payload = {"text": query}
response = httpclient.post(api_url, json=payload)
query_embedding = response.json()["result"]["data"]
res = ob_client.ann_search(
table_name,
vec_data=query_embedding[0],
vec_column_name="content_vec",
distance_func=cosine_distance,
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f' - ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected results
- ID: 2
content: Python is the preferred programming language for data science
distance: 0.139745337621493