OceanBase Database supports vector data storage, vector indexes, and embedding vector retrieval in V4.3.3 and later. You can store vectorized data in OceanBase Database for further retrieval.
As an artificial intelligence company, OpenAI has developed several large language models (LLMs) that are excellent at natural language understanding and generation, and capable of generating text, answering questions, and conducting conversations. You can access these models through the OpenAI API.
This topic describes how to use the OpenAI API to store vectorized data in OceanBase Cloud and query data by using the vector retrieval feature of OceanBase Cloud.
Prerequisites
A transactional cluster instance of the MySQL compatible mode is available in your environment.
To use a cluster instance, you first need to create a tenant by referring to Create a tenant.
You have created a MySQL-compatible tenant, a database, and an account, and granted the read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL compatible mode).
You have been granted the project admin or instance admin role to perform read and write operations on instances in the project. If you do not have the required permissions, contact the organization admin.
You have installed Python 3.9 or later and pip. If Python installed on your server is of an early version, you can use Miniconda to build a new environment of Python 3.9 or later. For more information, see Installing Miniconda.
You have installed Poetry, Pyobvector, and OpenAI. The installation commands are as follows:
python3 -m pip install poetry python3 -m pip install pyobvector python3 -m pip install openai
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Register an LLM account
Obtain an OpenAI API key.
Log in to the OpenAI platform.
Click API Keys in the upper-right corner.
Click Create API Key.
Specify the required information and click Create API Key.
Specify the API key for the relevant environment variable.
- For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
export OPENAI_API_KEY='your-api-key'
- For a Windows system, you can run the following command in Command Prompt:
set OPENAI_API_KEY=your-api-key
You must replace your-api-key with the actual OpenAI API key.
Step 3: Store and query vectorized data
Store vectorized data in OceanBase Cloud
- Prepare test data.
Download the CSV file of precalculated vectorized data. This file is an open dataset of 1,000 food comments. The last column stores the vectorized values. Therefore, no vector calculation is required. You can also run the following command to recalculate values in the embedding column (vector column) to generate a new CSV file:
from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# Here the text-embedding-ada-002 model is used. You can change the model as needed.
def embedding_text(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = client.embeddings.create(input=text, model=model)
return res.data[0].embedding
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row.
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)
- Run the following script to insert the test data into OceanBase Cloud. The script must be located in the same directory as the test data.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to OceanBase Cloud by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The last column is a vector column.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Define the table name.
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index on the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header line.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data by inserting 10 rows into the database each time.
for i, row in enumerate(csvreader):
# The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert 10 rows each time.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the rest rows, if any.
if batch:
client.insert(table_name,batch)
# Check the data in the table and make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of inserted rows:{result[0]}")
Query the vector data
- Save the following Python script and name it as
openAIQuery.py.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
from openai import OpenAI
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to OceanBase Cloud by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
openAIclient = OpenAI()
# Define the function for generating text vectors.
def generate_embeddings(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = openAIclient.embeddings.create(input=text, model=model)
return res.data[0].embedding
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Perform an approximate nearest neighbor search (ANNS).
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1)
- Enter a question for an answer.
python3 openAIQuery.py 'pet food'
The expected result is as follows:
Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.