OceanBase Database supports vector storage and retrieval using pyobvector. This topic describes how to get started with pyobvector. pyobvector is a Python SDK for vector storage in OceanBase Database. It is based on SQLAlchemy and is basically compatible with Milvus API.
Prerequisites
Make sure that you have deployed the OceanBase cluster and created a MySQL tenant.
Make sure that your environment has installed Python 3.9 or later.
Get started
Install pyobvector
First, you need to install pyobvector in your local environment. You can refer to the following command:
pip install -U pyobvector
The -U option indicates that if pyobvector is installed in your environment, it will be automatically upgraded to the latest version. If not, pyobvector will be installed in the latest version.
Usage
pyobvector supports the following two modes:
Milvus-compatible mode: You can use the vector storage and retrieval capabilities of OceanBase Database in a way that is similar to the APIs of Milvus.
Hybrid SQLAlchemy mode: You can use the vector storage feature provided by the ObVecClient class and execute relational database statements by using the SQLAlchemy library. In this mode, you can take pyobvector as an extension of SQLAlchemy.
Milvus-compatible mode
Connect to the client
pyobvector provides the MilvusLikeClient class that allows you to use the vector storage and retrieval capabilities of OceanBase Database in a way that is compatible with Milvus APIs. You can create a client object by using the following statement:
from pyobvector import *
# Please modify the following database connection information to that of your database instance.
client = MilvusLikeClient(uri="127.0.0.1:2881", user="root@test", db_name="test")
Create a collection with vector indexes
To be compatible with the APIs of Milvus, pyobvector's MilvusClient also provides methods such as create_collection. However, mapping to OceanBase Database, it actually creates a table. You can refer to the following example to create a table with three columns named id, embedding, and metadata, and create a HNSW vector index for the embedding column. The embedding column is a vector of 64 dimensions.
fields = [
FieldSchema(
name="id",
dtype=DataType.INT64,
is_primary=True,
auto_id=True,
),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=64),
FieldSchema(name="metadata", dtype=DataType.JSON),
]
index_params = MilvusLikeClient.prepare_index_params()
index_params.add_index(
field_name="embedding",
index_name="embedding_idx",
index_type=VecIndexType.HNSW,
distance="l2",
m=16,
ef_construction=256,
)
schema = CollectionSchema(fields)
table_name = "vector_search"
client.create_collection(table_name, schema=schema, index_params=index_params)
Construct and write data
To simulate the scenario of searching for vectors in a large amount of vector data, you can construct some vector data in this step. You can use the random module in Python to construct a list of random floating-point numbers.
import random
random.seed(20241023)
batch_size = 100
batch = []
for i in range(1000):
batch.append(
{
"embedding": [random.uniform(-1, 1) for _ in range(64)],
"metadata": {"idx": i},
}
)
if len(batch) == batch_size:
client.insert(collection_name=table_name, data=batch)
batch = []
if len(batch) > 0:
client.insert(collection_name=table_name, data=batch)
Perform similar vector search
You can use random.uniform to construct a target vector data and perform vector search in the collection where the data is inserted:
target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.search(
collection_name=table_name,
data=target_data,
anns_field="embedding",
limit=5,
output_fields=["id", "metadata"],
)
print(res)
# The expected return result is as follows:
# [{'id': 63, 'metadata': {'idx': 62}}, {'id': 796, 'metadata': {'idx': 795}}, {'id': 187, 'metadata': {'idx': 186}}, {'id': 784, 'metadata': {'idx': 783}}, {'id': 880, 'metadata': {'idx': 879}}]
Hybrid mode of SQLAlchemy
Connect to the client
Pyobvector provides the ObVecClient class that allows you to use the vector storage and retrieval capabilities of OceanBase Database in hybrid mode with SQLAlchemy. You can create a client object by using the following statement:
from pyobvector import *
# Please modify the following database connection information parameters to your database instance.
client = ObVecClient(uri="127.0.0.1:2881", user="root@test", db_name="test")
Create a table and a vector index
The following example creates a table with three columns named id, embedding, and metadata, and creates a HNSW vector index on the embedding column. The embedding column is a vector of 64 dimensions.
from sqlalchemy import Column, Integer, JSON
from sqlalchemy import func
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("embedding", VECTOR(64)),
Column("metadata", JSON),
]
table_name = "vector_test3"
client.create_table(table_name, columns=cols)
print(f"Table {table_name} created")
client.create_index(
table_name,
is_vec_index=True,
index_name="embedding_idx",
column_names=["embedding"],
vidx_params="distance=l2, type=hnsw, lib=vsag", # m=16, ef_construction=256
)
print(f"Index {table_name}.embedding_idx created")
Construct and write data
To simulate the scenario of vector search in a large amount of vector data, construct some vector data in this step. You can use the random module in Python to construct a list of random floating-point numbers.
import random
random.seed(20241023)
batch_size = 100
batch = []
for i in range(1000):
batch.append(
{
"embedding": [random.uniform(-1, 1) for _ in range(64)],
"metadata": {"idx": i},
}
)
if len(batch) == batch_size:
client.insert(table_name, data=batch)
batch = []
if len(batch) > 0:
client.insert(table_name, data=batch)
Perform similar vector search
Use random.uniform to generate a target vector as input, and perform vector search in the set where data is inserted:
target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.ann_search(
table_name,
vec_data=target_data,
vec_column_name="embedding",
distance_func=func.l2_distance,
topk=5,
output_column_names=["id", "metadata"],
)
for r in res:
print(r)
# The expected return result is as follows:
# (63, '{"idx": 62}')
# (796, '{"idx": 795}')
# (187, '{"idx": 186}')
# (784, '{"idx": 783}')
# (880, '{"idx": 879}')
References
For more information about vector data, see Vector data.
For information about how to create a vector index and drop an index after a table is created, see Create a vector index.