OceanBase Database allows you to use pyobvector for vector storage and search. This topic describes how to do so. pyobvector is a Python SDK used by OceanBase Database to store vectors. It is built based on SQLAlchemy and is largely compatible with Milvus APIs.
Prerequisites
You have deployed an OceanBase cluster and created a MySQL tenant.
You have installed Python 3.9 or later in your environment.
Quick start
Install pyobvector
First, install pyobvector in your local environment. You can run the following command:
pip install -U pyobvector
The -U parameter specifies to automatically upgrade pyobvector to the latest version in your environment if it has been installed; otherwise, it will directly install the latest version.
Use pyobvector
You can use pyobvector in the following modes:
Milvus compatibility mode: You can use the vector storage feature through APIs provided by the MilvusLikeClient class that are compatible with Milvus.
SQLAlchemy hybrid mode: You can use the vector storage feature provided by the ObVecClient class and execute relational database statements by using the SQLAlchemy library. In this mode, pyobvector can be considered as an extension of SQLAlchemy.
Milvus compatibility mode
Connect to the client
You can use the Milvus-like client provided by pyobvector to access the vector storage and retrieval capabilities of OceanBase Database in a way compatible with Milvus. You can create a client object by running the following statement:
from pyobvector import *
# Modify the database connection information below to connect to your database instance.
client = MilvusLikeClient(uri="127.0.0.1:2881", user="root@test", db_name="test")
Create a collection with a vector index
The MilvusClient of pyobvector also provides the create_collection method compatible with the API of Milvus. However, in OceanBase Database, the create_collection method creates a table. In the following example, a table named vector_search is created, which has id, embedding, and metadata columns. A HNSW vector index is created for the embedding column. The embedding column is a vector of 64 dimensions.
fields = [
FieldSchema(
name="id",
dtype=DataType.INT64,
is_primary=True,
auto_id=True,
),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=64),
FieldSchema(name="metadata", dtype=DataType.JSON),
]
index_params = MilvusLikeClient.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_name="embedding_idx",
index_type=VecIndexType.HNSW,
distance="l2",
m=16,
ef_construction=256,
)
schema = CollectionSchema(fields)
table_name = "vector_search"
client.create_collection(table_name, schema=schema, index_params=index_params)
Construct and write data
To simulate vector search in a large amount of vector data, construct some vector data in this step. You can use the random module in Python to generate a list of random floating-point numbers.
import random
random.seed(20241023)
batch_size = 100
batch = []
for i in range(1000):
batch.append(
{
"embedding": [random.uniform(-1, 1) for _ in range(64)],
"metadata": {"idx": i},
}
)
if len(batch) == batch_size:
client.insert(collection_name=table_name, data=batch)
batch = []
if len(batch) > 0:
client.insert(collection_name=table_name, data=batch)
Perform similarity vector search
Generate a target vector data list by using random.uniform and perform vector search in the collection where data was inserted earlier:
target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.search(
collection_name=table_name,
data=target_data,
anns_field="embedding",
limit=5,
output_fields=["id", "metadata"],
)
print(res)
# The expected return result is as follows:
# [{'id': 63, 'metadata': {'idx': 62}}, {'id': 796, 'metadata': {'idx': 795}}, {'id': 187, 'metadata': {'idx': 186}}, {'id': 784, 'metadata': {'idx': 783}}, {'id': 880, 'metadata': {'idx': 879}}]
SQLAlchemy hybrid mode
Connect to the client
You can use the ObVecClient class provided by pyobvector to access the vector storage and retrieval capabilities of OceanBase Database in hybrid SQLAlchemy mode. You can create a client object by running the following statement:
from pyobvector import *
# Modify the database connection information below to connect to your database instance.
client = ObVecClient(uri="127.0.0.1:2881", user="root@test", db_name="test")
Create a table and a vector index
In the following example, a table named vector_test3 is created, which has id, embedding, and metadata columns. A HNSW vector index is created for the embedding column. The embedding column is a vector of 64 dimensions.
from sqlalchemy import Column, Integer, JSON
from sqlalchemy import func
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("embedding", VECTOR(64)),
Column("metadata", JSON),
]
table_name = "vector_test3"
client.create_table(table_name, columns=cols)
print(f"Table {table_name} created")
client.create_index(
table_name,
is_vec_index=True,
index_name="embedding_idx",
column_names=["embedding"],
vidx_params="distance=l2, type=hnsw, lib=vsag", # m=16, ef_construction=256
)
print(f"Index {table_name}.embedding_idx created")
Construct and write data
To simulate vector search in a large amount of vector data, construct some vector data in this step. You can use the random module in Python to generate a list of random floating-point numbers.
import random
random.seed(20241023)
batch_size = 100
batch = []
for i in range(1000):
batch.append(
{
"embedding": [random.uniform(-1, 1) for _ in range(64)],
"metadata": {"idx": i},
}
)
if len(batch) == batch_size:
client.insert(table_name, data=batch)
batch = []
if len(batch) > 0:
client.insert(table_name, data=batch)
Perform similarity vector search
Generate a target vector data list by using random.uniform and perform vector search in the table where data was inserted earlier:
target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.ann_search(
table_name,
vec_data=target_data,
vec_column_name="embedding",
distance_func=func.l2_distance,
topk=5,
output_column_names=["id", "metadata"],
)
for r in res:
print(r)
# The expected return result is as follows:
# (63, '{"idx": 62}')
# (796, '{"idx": 795}')
# (187, '{"idx": 186}')
# (784, '{"idx": 783}')
# (880, '{"idx": 879}')
References
For more information about vector data, see Vector data type.
For more information about how to create and drop a vector index after table creation, see Vector indexes.