OceanBase OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. You can store vectorized data in OceanBase Database, making it available for fast and efficient search.
Cloudflare Workers AI is a service provided by Cloudflare that allows developers to run machine learning models on its global network. Developers can easily integrate AI features into their applications using REST APIs.
Prerequisites
You have deployed OceanBase Database V4.4.0 or later, and created a MySQL-compatible tenant. After creating the tenant, continue with the steps below.
Your environment includes an active MySQL-compatible tenant, a MySQL database, and a user account with read and write privileges.
Python 3.11 or above is installed.
Required dependencies are installed:
python3 -m pip install pyobvector requests sqlalchemy httpxMake sure you have set the
ob_vector_memory_limit_percentageparameter in your tenant to enable vector search. A recommended value is30. For details on configuring this parameter, refer to ob_vector_memory_limit_percentage.
Step 1: Get your database connection information
Reach out to your OceanBase administrator or deployment team to obtain the database connection string, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
$host: The IP address for connecting to OceanBase Database. If you are using OceanBase Database Proxy (ODP), use the ODP address. For direct connections, use the OBServer node IP.$port: The port number for connecting to OceanBase Database. The default for ODP is2883(can be customized during ODP deployment). For direct connections, the default is2881(customizable during OceanBase deployment).$database_name: The name of the database you want to access.Notice
The user connecting to the tenant must have
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more details on user privileges, see privilege types in MySQL-compatible mode.$user_name: The user account for connecting to the tenant. For ODP, common formats areusername@tenant_name#cluster_nameorcluster_name:tenant_name:username; for direct connections, useusername@tenant_name.$password: The password for the account.
For more details about connection strings, see Connect to an OceanBase tenant using OBClient.
Step 2: Build your AI assistant
Set up the Cloudflare API key environment variable
Obtain Cloudflare API key and configure it along with OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export CLOUDFLARE_API_KEY=YOUR_CLOUDFLARE_API_KEY
Sample code snippets
Here is an example using the bge-base-en-v1.5 model with the Cloudflare Workers AI embedding API to generate vector data:
import requests,os,httpx
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
documents = [
"Machine learning is the core technology of artificial intelligence",
"Python is the preferred programming language for data science",
"Cloud computing provides elastic and scalable computing resources",
"Blockchain technology ensures data security and transparency",
"Natural language processing helps computers understand human language"
]
BASE_URL = "https://api.cloudflare.com/client/v4/accounts"
model_name = "@cf/baai/bge-base-en-v1.5"
account_id="0f390650bbe6ff23336badcf24e85c93"
CLOUDFLARE_API_KEY = os.getenv('CLOUDFLARE_API_KEY')
api_url = f"{BASE_URL}/{account_id}/ai/run/{model_name}"
# Create an HTTP client
httpclient = httpx.Client()
httpclient.headers.update({
"Authorization": f"Bearer {CLOUDFLARE_API_KEY}",
"Accept-Encoding": "identity"
})
payload = {"text": documents}
response = httpclient.post(api_url, json=payload)
embedding_response = response.json()["result"]["data"]
data = []
for i, text in enumerate(documents):
data.append({
'content': text,
'content_vec': embedding_response[i] # Convert to list format
})
print(f"Successfully processed {len(data)} texts")
Define the vector table structure and store the vectors in OceanBase
Create a table named cloudflare_oceanbase_demo_documents with columns for storing text (content), embedding vectors (content_vec), and vector index information. Store the vector data in OceanBase:
#Connect OceanBase Serverless
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
ob_client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
#Create the vector table.
table_name = "cloudflare_oceanbase_demo_documents"
ob_client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(500), nullable=False),
Column("content_vec", VECTOR(768))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
ob_client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
ob_client.insert(table_name, data=data)
Perform semantic search
Generate an embedding vector for the query text using the Cloudflare Workers AI embedding API, then search for the most relevant documents based on the cosine distance between the query vector and each vector in the table:
# Query the most relevant document based on the query.
query = "Programming languages for data analysis"
# Generate the embedding for the query via Jina AI API.
payload = {"text": query}
response = httpclient.post(api_url, json=payload)
query_embedding = response.json()["result"]["data"]
res = ob_client.ann_search(
table_name,
vec_data=query_embedding[0],
vec_column_name="content_vec",
distance_func=cosine_distance,
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f' - ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected result
- ID: 2
content: Python is the preferred programming language for data science
distance: 0.139745337621493