OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.
Hugging Face is an open-source machine learning platform that provides pretrained models, datasets, and tools for developers to easily use and deploy AI models.
This tutorial demonstrates how to integrate the vector retrieval feature of OceanBase Cloud with Hugging Face to perform similarity search and retrieval tasks.
Prerequisites
A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later.
You have installed the required dependencies.
python3 -m pip install pyobvector sqlalchemy datasets transformers torch
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Build your AI assistant
Set environment variables
Obtain the Hugging Face API key and configure it along with your OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export HUGGING_FACE_API_KEY=YOUR_HUGGING_FACE_API_KEY
Sample code snippet
Prepare data
Hugging Face provides various embedding models, and you can choose the one that meets your requirements. Here, we use the sentence-transformers/all-MiniLM-L6-v2 model to call the Hugging Face embedding API to prepare data:
import os,shutil,torch,requests
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, l2_distance
# delete cache directory
if os.path.exists("./cache"):
shutil.rmtree("./cache")
HUGGING_FACE_API_KEY = os.getenv('HUGGING_FACE_API_KEY')
DATASET = "squad" # Name of dataset from HuggingFace Datasets
INSERT_RATIO = 0.001 # Ratio of example dataset to be inserted
data = load_dataset(DATASET, split="validation", cache_dir="./cache")
# Generates a fixed subset. To generate a random subset, remove the seed.
data = data.train_test_split(test_size=INSERT_RATIO, seed=42)["test"]
# Clean up the data structure in the dataset.
data = data.map(
lambda val: {"answer": val["answers"]["text"][0]},
remove_columns=["id", "answers", "context"],
)
# HuggingFace API config
import os
from sentence_transformers import SentenceTransformer
# Set HF Mirror for model download
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
print("Downloading model...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print("Model downloaded!")
def encode_text(batch):
questions = batch["question"]
# Perform inference using the local model
embeddings = model.encode(questions)
# Format the embeddings
formatted_embeddings = []
for embedding in embeddings:
formatted_embedding = [round(float(val), 6) for val in embedding]
formatted_embeddings.append(formatted_embedding)
batch["embedding"] = formatted_embeddings
return batch
INFERENCE_BATCH_SIZE = 64 # Batch size of model inference
data = data.map(encode_text, batched=True, batch_size=INFERENCE_BATCH_SIZE)
data_list = data.to_list()
Define the vector table structure and store the vectors in OceanBase
Create a table named huggingface_oceanbase_demo_documents that contains columns for storing text (title, question, and answer), a column for storing the embedding vectors (embedding), and a column for storing vector index information. Store the vector data in OceanBase:
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
table_name = "huggingface_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("title", String(255), nullable=False),
Column("question", String(255), nullable=False),
Column("answer", String(255), nullable=False),
Column("embedding", VECTOR(384))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_question_embedding",
field_name="embedding",
index_type="HNSW",
distance_metric="l2"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data_list)
Perform semantic search
Generate an embedding vector for the query text using the Hugging Face embedding API, and then query the l2 distance between the text's embedding vector and each embedding vector in the vector table to search for the most relevant documents:
# Step 5. Query the most relevant document based on the query.
questions = {
"question": [
"What is LGM?",
"When did Massachusetts first mandate that children be educated in schools?",
]
}
# Generate question embeddings
question_embeddings = encode_text(questions)["embedding"]
for i, question in enumerate(questions["question"]):
print(f"Question: {question}")
# Search across OceanBase
search_results = client.ann_search(
table_name,
vec_data=question_embeddings[i],
vec_column_name="embedding",
distance_func=l2_distance,
with_dist=True,
topk=3,
output_column_names=["id", "answer", "question"],
)
# Print out results
results_list = list(search_results)
for r in results_list:
print({
"answer": r[1],
"score": r[3] if len(r) > 3 else "N/A",
"original question": r[2],
"id": r[0]
})
print("\n")
Expected results
- Inserting Data to OceanBase...
Question: What is LGM?
{'answer': 'Last Glacial Maximum', 'score': 0.29572604605808755, 'original question': 'What does LGM stands for?', 'id': 10}
{'answer': 'coordinate the response to the embargo', 'score': 1.2553772660960183, 'original question': 'Why was this short termed organization created?', 'id': 9}
{'answer': '"Reducibility Among Combinatorial Problems"', 'score': 1.2691888905109625, 'original question': 'What is the paper written by Richard Karp in 1972 that ushered in a new era of understanding between intractability and NP-complete problems?', 'id': 11}
Question: When did Massachusetts first mandate that children be educated in schools?
{'answer': '1852', 'score': 0.2408329167590669, 'original question': 'In what year did Massachusetts first require children to be educated in schools?', 'id': 1}
{'answer': 'several regional colleges and universities', 'score': 1.1474774558319025, 'original question': 'In 1890, who did the university decide to team up with?', 'id': 4}
{'answer': '1962', 'score': 1.2703532682776688, 'original question': 'When were stromules discovered?', 'id': 2}