OceanBase Database provides capabilities for vector-type storage, vector indexing, and embedding vector search. You can store vector data in OceanBase Database for subsequent searches.
Jina AI is an AI platform framework focused on multimodal search and vector search. It provides core components and tools needed to build enterprise-level search-enhanced generative AI applications, helping enterprises and developers build RAG (Retrieval-Augmented Generation) applications based on multimodal search.
Prerequisites
You have deployed OceanBase Database V4.4.0 or a later version and created a MySQL mode tenant. After you create a tenant, you can perform the following steps.
Your environment has a MySQL tenant, a MySQL database, and a database account, and the database account has read and write permissions.
Installed Python 3.11 or later.
Installed the dependencies.
python3 -m pip install pyobvector requests sqlalchemyYou have set the
ob_vector_memory_limit_percentageparameter to enable vector search. For OceanBase Database versions earlier than V4.3.5 BP3, we recommend that you set the value to30. For OceanBase Database V4.3.5 BP3 and later, we recommend that you keep the default value0. For more information about how to set this parameter, see ob_vector_memory_limit_percentage.
Step 1: Obtain database connection information
Obtain the database connection string from the OceanBase Database deployment engineer or administrator. Example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
$host: the IP address for connecting to OceanBase Database. For the OceanBase Database Proxy (ODP) connection method, use the ODP IP address. For the direct connection method, use the IP address of the OBServer node.$port: the port for connecting to OceanBase Database. For the ODP connection method, the default port is2883, which can be customized when you deploy ODP. For the direct connection method, the default port is2881, which can be customized when you deploy OceanBase Database.$database_name: the name of the database to be accessed.Notice
The user for connecting to the tenant must have the
CREATE,INSERT,DROP, andSELECTpermissions on the database. For more information about user permissions, see Permissions in MySQL mode.$user_name: the tenant connection account. For the ODP connection method, the common format isusername@tenant name#cluster nameorcluster name:tenant name:username. For the direct connection method, the format isusername@tenant name.$password: the account password.
For more information about the connection string, see Connect to an OceanBase tenant by using OBClient.
Step 2: Set the Jina AI API key environment variable
Obtain your Jina AI API key and configure it along with the OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Sample code snippets
Obtain Jina AI embeddings
Jina AI provides various embedding models, and you can select the one that best suits your needs.
| Model | Parameter Size | Embedding Dimension | Text |
|---|---|---|---|
| jina-embeddings-v3 | 570M | flexible embedding size (Default: 1024) | multilingual text embeddings; supports 94 languages in total |
| jina-embeddings-v2-small-en | 33M | 512 | English monolingual embeddings |
| jina-embeddings-v2-base-en | 137M | 768 | English monolingual embeddings |
| jina-embeddings-v2-base-zh | 161M | 768 | Chinese-English Bilingual embeddings |
| jina-embeddings-v2-base-de | 161M | 768 | German-English Bilingual embeddings |
| jina-embeddings-v2-base-code | 161M | 768 | English and programming languages |
Here, we use jina-embeddings-v3 as an example and define a generate_embeddings helper function to call the Jina AI embedding API:
import os
import requests
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
# Step 1. Text Data Vectorization
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3'
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
response_json = response.json()
return response_json['data'][0]['embedding']
TEXTS = [
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
'OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.',
'OceanBase is a native distributed relational database that supports HTAP hybrid transaction analysis and processing. It features enterprise-level characteristics such as high availability, transparent scalability, and multi-tenancy, and is compatible with MySQL/Oracle protocols.'
]
data = []
for text in TEXTS:
# Generate the embedding for the text via Jina AI API.
embedding = generate_embeddings(text)
data.append({
'content': text,
'content_vec': embedding
})
print(f"Successfully processed {len(data)} texts")
Define the vector table schema and store the vectors in OceanBase
Create a table named jinaai_oceanbase_demo_documents with columns for storing text (content), embeddings (content_vec), and vector index information. Store the vector data in OceanBase:
# Step 2. Connect OceanBase Serverless
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
# Step 3. Create the vector table.
table_name = "jinaai_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(500), nullable=False),
Column("content_vec", VECTOR(1024))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)
Semantic search
Generate an embedding for the query text using the Jina AI embedding API. Then, search for the most relevant document based on the cosine distance between the query embedding and the embeddings in the vector table:
# Step 4. Query the most relevant document based on the query.
query = 'What is OceanBase?'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = client.ann_search(
table_name,
vec_data=query_embedding,
vec_column_name="content_vec",
distance_func=cosine_distance, # Use the cosine distance function
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f' - ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected result
- ID: 2
content: OceanBase Database is an enterprise-level, native distributed database independently developed by the OceanBase team. It is cloud-native, highly consistent, and highly compatible with Oracle and MySQL.
distance: 0.14733879001870276