OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.
Jina AI is an AI platform framework that focuses on multimodal search and vector retrieval. It provides core components and tools for building enterprise-level search-enhanced generative AI applications, helping enterprises and developers build RAG (retrieval-augmented generation) applications based on multimodal search.
This tutorial demonstrates how to integrate the vector retrieval feature of OceanBase Cloud with Jina AI to perform similarity search and retrieval tasks.
Prerequisites
A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later and pip. If the Python version on your server is low, you can use Miniconda to create a Python 3.11 or later environment. For more information, see Miniconda installation guide.
You have installed the dependencies.
python3 -m pip install pyobvector requests sqlalchemy
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Build an AI assistant
Set the Jina AI API key environment variables
Obtain the Jina AI API key and configure it in the environment variables along with the OceanBase connection information.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Sample code snippets
Obtain the embedding vectors of Jina AI
Jina AI provides various embedding models. You can select the model that meets your requirements.
| Model | Parameter size | Embedding dimension | Text |
|---|---|---|---|
| jina-embeddings-v3 | 570M | Flexible embedding size (Default: 1024) | Multilingual text embeddings, supporting 94 languages in total |
| jina-embeddings-v2-small-en | 33M | 512 | English monolingual embeddings |
| jina-embeddings-v2-base-en | 137M | 768 | English monolingual embeddings |
| jina-embeddings-v2-base-zh | 161M | 768 | Chinese-English Bilingual embeddings |
| jina-embeddings-v2-base-de | 161M | 768 | German-English Bilingual embeddings |
| jina-embeddings-v2-base-code | 161M | 768 | English and programming languages |
Here is an example of defining the generate_embeddings function to call the Jina AI embedding API:
import os
import requests
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
# Step 1. Text Data Vectorization
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3'
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
response_json = response.json()
return response_json['data'][0]['embedding']
TEXTS = [
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
'OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.',
'OceanBase is a native distributed relational database that supports HTAP hybrid transaction analysis and processing. It features enterprise-level characteristics such as high availability, transparent scalability, and multi-tenancy, and is compatible with MySQL/Oracle protocols.'
]
data = []
for text in TEXTS:
# Generate the embedding for the text via Jina AI API.
embedding = generate_embeddings(text)
data.append({
'content': text,
'content_vec': embedding
})
print(f"Successfully processed {len(data)} texts")
Define the vector table schema and store the vectors in OceanBase
Create a table named jinaai_oceanbase_demo_documents that contains the content column for storing text, the content_vec column for storing embedding vectors, and the vector index information. Store the vector data in OceanBase:
# Step 2. Connect OceanBase Serverless
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
# Step 3. Create the vector table.
table_name = "jinaai_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(500), nullable=False),
Column("content_vec", VECTOR(1024))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)
Perform semantic search
Generate an embedding vector for the query text using the Jina AI embedding API, and search for the most relevant documents based on the cosine distance between the query text's embedding vector and each embedding vector in the vector table:
# Step 4. Query the most relevant document based on the query.
query = 'What is OceanBase?'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = client.ann_search(
table_name,
vec_data=query_embedding,
vec_column_name="content_vec",
distance_func=cosine_distance, # Use the cosine distance function.
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f' - ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected output
- ID: 2
content: OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
distance: 0.14733879001870276