OceanBase Database supports vector data storage, vector indexes, and embedding vector retrieval in V4.3.3 and later. You can store vectorized data in OceanBase Database for further retrieval.
Tongyi Qianwen (Qwen) is a large language model (LLM) developed by Alibaba Cloud for interpreting and analyzing user inputs. You can try the API service of the Qwen model in the Alibaba Cloud model experience center.
This topic describes how to use the Qwen API to store vectorized data in OceanBase Cloud and query data by using the vector retrieval feature of OceanBase Cloud.
Prerequisites
A transactional cluster instance of the MySQL compatible mode is available in your environment.
To use a cluster instance, you first need to create a tenant by referring to Create a tenant.
You have created a MySQL-compatible tenant, a database, and an account, and granted the read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL compatible mode).
You have been granted the project admin or instance admin role to perform read and write operations on instances in the project. If you do not have the required permissions, contact the organization admin.
You have installed Python 3.9 or later and pip. If Python installed on your server is of an early version, you can use Miniconda to build a new environment of Python 3.9 or later. For more information, see Installing Miniconda.
You have installed Poetry, Pyobvector, and DashScope. The installation commands are as follows:
python3 -m pip install poetry python3 -m pip install pyobvector python3 -m pip install dashscope
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Register an LLM account
Register an Alibaba Cloud Model Studio account, activate the model service, and obtain an API key.
Notice
-
The Qwen LLM provides a free quota. Charges will be incurred after the free quota is used up.
Specify the API key for the relevant environment variable.
- For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
- For a Windows system, you can run the following command in Command Prompt:
set DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
You must replace YOUR_DASHSCOPE_API_KEY with the actual Qwen API key.
Step 3: Store and query vectorized data
Store vectorized data in OceanBase Cloud
- Prepare test data.
Download the CSV file of precalculated vectorized data. This file is an open dataset of 1,000 food comments. The last column stores the vectorized values. Therefore, no vector calculation is required. You can also run the following command to recalculate values in the embedding column (vector column) to generate a new CSV file:
import dashscope
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
# Here the text_embedding_v1 model is used. You can change the model as needed.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the Tongyi Qianwen Embedding API row by row.
df["embedding"] = df.combined.apply(generate_embeddings)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)
- Run the following script to insert the test data into OceanBase Cloud. The script must be located in the same directory as the test data.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to OceanBase Cloud by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The last column is a vector column.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Define the table name.
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index on the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header line.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data by inserting 10 rows into the database each time.
for i, row in enumerate(csvreader):
# The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert 10 rows each time.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the rest rows, if any.
if batch:
client.insert(table_name,batch)
# Check the data in the table and make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of inserted rows:{result[0]}")
Query the vector data
- Save the following Python script and name it as
query.py.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
import dashscope
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to OceanBase Cloud by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
# Define the function for generating text vectors.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Perform an approximate nearest neighbor search (ANNS).
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1)
- Enter a question for an answer.
python3 query.py 'pet food'
The expected result is as follows:
This is so good!: I purchased this after my sister sent a small bag to me in a gift box. I loved it so much I wanted to find it to buy for myself and keep it around. I always look on Amazon because you can find everything here and true enough, I found this wonderful candy. It is nice to keep in your purse for when you are out and about and get a dry throat or a tickle in the back of your throat. It is also nice to have in a candy dish at home for guests to try.