Integrate OceanBase vector search with Hugging Face|V4.3.5| docs|Distributed Database

Integrate OceanBase vector search with Hugging Face

Last Updated：2025-11-20 02:58:31 Updated

OceanBase OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. You can store vectorized data in OceanBase Database, making it available for fast and efficient search.

Hugging Face is an open-source machine learning platform that provides pre-trained models, datasets, and tools for developers to easily use and deploy AI models.

Prerequisites

You have deployed OceanBase Database V4.4.0 or later, and created a MySQL-compatible tenant. After creating the tenant, continue with the steps below.
Your environment includes an active MySQL-compatible tenant, a MySQL database, and a user account with read and write privileges.
Python 3.11 or above is installed.

Required dependencies are installed:

python3 -m pip install pyobvector sqlalchemy datasets transformers torch

Make sure you have set the ob_vector_memory_limit_percentage parameter in your instance to enable vector search. A recommended value is 30. For details on configuring this parameter, refer to ob_vector_memory_limit_percentage.

Step 1: Get your database connection information

Reach out to your OceanBase administrator or deployment team to obtain the database connection string, for example:

obclient -h$host -P$port -u$user_name -p$password -D$database_name

Parameters:

$host: The IP address for connecting to OceanBase Database. If you are using OceanBase Database Proxy (ODP), use the ODP address. For direct connections, use the OBServer node IP.
$port: The port number for connecting to OceanBase Database. The default for ODP is 2883 (can be customized during ODP deployment). For direct connections, the default is 2881 (customizable during OceanBase deployment).
$database_name: The name of the database you want to access.

Notice

The user connecting to the tenant must have CREATE, INSERT, DROP, and SELECT privileges on the database. For more details on user privileges, see privilege types in MySQL-compatible mode.
$user_name: The user account for connecting to the tenant. For ODP, common formats are username@tenant_name#cluster_name or cluster_name:tenant_name:username; for direct connections, use username@tenant_name.
$password: The password for the account.

For more details about connection strings, see Connect to an OceanBase tenant using OBClient.

Step 2: Build your AI assistant

Set environment variables

Obtain the Hugging Face API key, and configure the OceanBase connection information to the environment variables.

export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export HUGGING_FACE_API_KEY=YOUR_HUGGING_FACE_API_KEY

Sample code snippets

Prepare data

Hugging Face provides various embedding models. You can select the model that meets your needs. Here, we use the sentence-transformers/all-MiniLM-L6-v2 model to call the Hugging Face embedding API and prepare data:

import os,shutil,torch,requests
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, l2_distance

# delete cache directory
if os.path.exists("./cache"):
    shutil.rmtree("./cache")


HUGGING_FACE_API_KEY = os.getenv('HUGGING_FACE_API_KEY')
DATASET = "squad"  # Name of dataset from HuggingFace Datasets
INSERT_RATIO = 0.001  # Ratio of example dataset to be inserted
data = load_dataset(DATASET, split="validation", cache_dir="./cache")

# Generates a fixed subset. To generate a random subset, remove the seed.
data = data.train_test_split(test_size=INSERT_RATIO, seed=42)["test"]
# Clean up the data structure in the dataset.
data = data.map(
    lambda val: {"answer": val["answers"]["text"][0]},
    remove_columns=["id", "answers", "context"],
)

# HuggingFace API config
import os
from sentence_transformers import SentenceTransformer
# Set the HF Mirror for model download.
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
print("Downloading the model...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print("Model download completed!")

def encode_text(batch):
    questions = batch["question"]

    # Perform inference using the local model.
    embeddings = model.encode(questions)

    # Format the embeddings.
    formatted_embeddings = []
    for embedding in embeddings:
        formatted_embedding = [round(float(val), 6) for val in embedding]
        formatted_embeddings.append(formatted_embedding)

    batch["embedding"] = formatted_embeddings
    return batch

INFERENCE_BATCH_SIZE = 64  # Batch size of model inference
data = data.map(encode_text, batched=True, batch_size=INFERENCE_BATCH_SIZE)
data_list = data.to_list()

Define the vector table structure and store the vector data in OceanBase

Create a table named huggingface_oceanbase_demo_documents that contains the title, question, and answer columns for storing text, and the embedding column for storing the embedding vectors and the vector index information. Store the vector data in OceanBase:

OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')

client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)

table_name = "huggingface_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)

cols = [
    Column("id", Integer, primary_key=True, autoincrement=True),
    Column("title", String(255), nullable=False),
    Column("question", String(255), nullable=False),
    Column("answer", String(255), nullable=False),
    Column("embedding", VECTOR(384))
]

# Create vector index
vector_index_params = IndexParam(
    index_name="idx_question_embedding",
    field_name="embedding",  
    index_type="HNSW",
    distance_metric="l2"
)

client.create_table_with_index_params(
    table_name=table_name,
    columns=cols, 
    vidxs=[vector_index_params]
)


print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data_list)

Perform semantic search

Generate an embedding vector for the query text by using the Hugging Face embedding API. Then, query the l2 distance between the embedding vector and each embedding vector in the vector table to search for the most relevant documents:

# Step 5. Query the most relevant document based on the query.
questions = {
    "question": [
        "What is LGM?",
        "When did Massachusetts first mandate that children be educated in schools?",
    ]
}

# Generate question embeddings
question_embeddings =  encode_text(questions)["embedding"]

for i, question in enumerate(questions["question"]):
    print(f"Question: {question}")
    
    # Search across OceanBase
    search_results = client.ann_search(
        table_name,
        vec_data=question_embeddings[i],
        vec_column_name="embedding",
        distance_func=l2_distance,
        with_dist=True,
        topk=3,
        output_column_names=["id", "answer", "question"],
    )
    
    # Print out results
    results_list = list(search_results)
    
    for r in results_list:
        print({
            "answer": r[1],
            "score": r[3] if len(r) > 3 else "N/A",
            "original question": r[2],
            "id": r[0]
        })
    print("\n")

Expected result

- Inserting Data to OceanBase...
Question: What is LGM?
{'answer': 'Last Glacial Maximum', 'score': 0.29572604605808755, 'original question': 'What does LGM stands for?', 'id': 10}
{'answer': 'coordinate the response to the embargo', 'score': 1.2553772660960183, 'original question': 'Why was this short termed organization created?', 'id': 9}
{'answer': '"Reducibility Among Combinatorial Problems"', 'score': 1.2691888905109625, 'original question': 'What is the paper written by Richard Karp in 1972 that ushered in a new era of understanding between intractability and NP-complete problems?', 'id': 11}


Question: When did Massachusetts first mandate that children be educated in schools?
{'answer': '1852', 'score': 0.2408329167590669, 'original question': 'In what year did Massachusetts first require children to be educated in schools?', 'id': 1}
{'answer': 'several regional colleges and universities', 'score': 1.1474774558319025, 'original question': 'In 1890, who did the university decide to team up with?', 'id': 4}
{'answer': '1962', 'score': 1.2703532682776688, 'original question': 'When were stromules discovered?', 'id': 2}