Use Python for vector retrieval|V4.3.3| docs|Distributed Database

Use Python for vector retrieval

Last Updated：2025-11-27 06:23:27 Updated

OceanBase Database supports vector storage and retrieval using pyobvector. This topic describes how to get started with pyobvector. pyobvector is a Python SDK for vector storage in OceanBase Database. It is based on SQLAlchemy and is basically compatible with Milvus API.

Prerequisites

Make sure that you have deployed the OceanBase cluster and created a MySQL tenant.
Make sure that your environment has installed Python 3.9 or later.

Get started

Install pyobvector

First, you need to install pyobvector in your local environment. You can refer to the following command:

pip install -U pyobvector

The -U option indicates that if pyobvector is installed in your environment, it will be automatically upgraded to the latest version. If not, pyobvector will be installed in the latest version.

Usage

pyobvector supports the following two modes:

Milvus-compatible mode: You can use the vector storage and retrieval capabilities of OceanBase Database in a way that is similar to the APIs of Milvus.
Hybrid SQLAlchemy mode: You can use the vector storage feature provided by the ObVecClient class and execute relational database statements by using the SQLAlchemy library. In this mode, you can take pyobvector as an extension of SQLAlchemy.

Milvus-compatible mode

Connect to the client

pyobvector provides the MilvusLikeClient class that allows you to use the vector storage and retrieval capabilities of OceanBase Database in a way that is compatible with Milvus APIs. You can create a client object by using the following statement:

from pyobvector import *

# Please modify the following database connection information to that of your database instance.
client = MilvusLikeClient(uri="127.0.0.1:2881", user="root@test", db_name="test")

Create a collection with vector indexes

To be compatible with the APIs of Milvus, pyobvector's MilvusClient also provides methods such as create_collection. However, mapping to OceanBase Database, it actually creates a table. You can refer to the following example to create a table with three columns named id, embedding, and metadata, and create a HNSW vector index for the embedding column. The embedding column is a vector of 64 dimensions.

fields = [
    FieldSchema(
        name="id",
        dtype=DataType.INT64,
        is_primary=True,
        auto_id=True,
    ),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=64),
    FieldSchema(name="metadata", dtype=DataType.JSON),
]

index_params = MilvusLikeClient.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_name="embedding_idx",
    index_type=VecIndexType.HNSW,
    distance="l2",
    m=16,
    ef_construction=256,
)

schema = CollectionSchema(fields)
table_name = "vector_search"
client.create_collection(table_name, schema=schema, index_params=index_params)

Construct and write data

To simulate the scenario of searching for vectors in a large amount of vector data, you can construct some vector data in this step. You can use the random module in Python to construct a list of random floating-point numbers.

import random

random.seed(20241023)

batch_size = 100
batch = []
for i in range(1000):
    batch.append(
        {
            "embedding": [random.uniform(-1, 1) for _ in range(64)],
            "metadata": {"idx": i},
        }
    )
    if len(batch) == batch_size:
        client.insert(collection_name=table_name, data=batch)
        batch = []

if len(batch) > 0:
    client.insert(collection_name=table_name, data=batch)

Perform similar vector search

You can use random.uniform to construct a target vector data and perform vector search in the collection where the data is inserted:

target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.search(
    collection_name=table_name,
    data=target_data,
    anns_field="embedding",
    limit=5,
    output_fields=["id", "metadata"],
)
print(res)
# The expected return result is as follows:
# [{'id': 63, 'metadata': {'idx': 62}}, {'id': 796, 'metadata': {'idx': 795}}, {'id': 187, 'metadata': {'idx': 186}}, {'id': 784, 'metadata': {'idx': 783}}, {'id': 880, 'metadata': {'idx': 879}}]

Hybrid mode of SQLAlchemy

Connect to the client

Pyobvector provides the ObVecClient class that allows you to use the vector storage and retrieval capabilities of OceanBase Database in hybrid mode with SQLAlchemy. You can create a client object by using the following statement:

from pyobvector import *

# Please modify the following database connection information parameters to your database instance.
client = ObVecClient(uri="127.0.0.1:2881", user="root@test", db_name="test")

Create a table and a vector index

The following example creates a table with three columns named id, embedding, and metadata, and creates a HNSW vector index on the embedding column. The embedding column is a vector of 64 dimensions.

from sqlalchemy import Column, Integer, JSON
from sqlalchemy import func

cols = [
    Column("id", Integer, primary_key=True, autoincrement=True),
    Column("embedding", VECTOR(64)),
    Column("metadata", JSON),
]
table_name = "vector_test3"
client.create_table(table_name, columns=cols)
print(f"Table {table_name} created")
client.create_index(
    table_name,
    is_vec_index=True,
    index_name="embedding_idx",
    column_names=["embedding"],
    vidx_params="distance=l2, type=hnsw, lib=vsag",  # m=16, ef_construction=256
)
print(f"Index {table_name}.embedding_idx created")

Construct and write data

To simulate the scenario of vector search in a large amount of vector data, construct some vector data in this step. You can use the random module in Python to construct a list of random floating-point numbers.

import random

random.seed(20241023)

batch_size = 100
batch = []
for i in range(1000):
    batch.append(
        {
            "embedding": [random.uniform(-1, 1) for _ in range(64)],
            "metadata": {"idx": i},
        }
    )
    if len(batch) == batch_size:
        client.insert(table_name, data=batch)
        batch = []

if len(batch) > 0:
    client.insert(table_name, data=batch)

Perform similar vector search

Use random.uniform to generate a target vector as input, and perform vector search in the set where data is inserted:

target_data = [random.uniform(-1, 1) for _ in range(64)]
res = client.ann_search(
    table_name,
    vec_data=target_data,
    vec_column_name="embedding",
    distance_func=func.l2_distance,
    topk=5,
    output_column_names=["id", "metadata"],
)
for r in res:
    print(r)
# The expected return result is as follows:
# (63, '{"idx": 62}')
# (796, '{"idx": 795}')
# (187, '{"idx": 186}')
# (784, '{"idx": 783}')
# (880, '{"idx": 879}')

References

For more information about vector data, see Vector data.
For information about how to create a vector index and drop an index after a table is created, see Create a vector index.