OpenAI is an artificial intelligence company that has developed advanced large language models. These models excel in natural language understanding and generation, enabling functionalities such as text generation, question answering, and conversational interactions, all accessible via APIs.
OceanBase Database provides vector storage, vector indexing, and embedding vector search capabilities. You can leverage OpenAI's APIs to store vectorized data in OceanBase Database and use OceanBase Database's vector search capabilities to query related data.
Prerequisites
You have deployed OceanBase Database V4.4.0 or later and created a MySQL-compatible tenant. After you create a tenant, follow the steps below. For more information, see Create a tenant.
You have a MySQL-compatible tenant, database, and account in your environment, and the database account has read and write privileges.
You have installed Python 3.9 or later and the corresponding pip.
You have installed Poetry, PyObVector, and the OpenAI SDK.
python3 pip install poetry python3 pip install pyobvector python3 pip install openaiYou have obtained an OpenAI API key.
You have set the
ob_vector_memory_limit_percentageparameter in the tenant to enable vector search. We recommend that you set the value to30for OceanBase Database versions earlier than V4.4.1, and keep the default value0for V4.4.1 and later. For more information about this parameter, see ob_vector_memory_limit_percentage.
Step 1: Obtain the connection string of OceanBase Database
Contact the deployment personnel or administrator of OceanBase Database to obtain the database connection string. For example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameter description:
$host: the IP address for connecting to OceanBase Database. For connection through OceanBase Database Proxy (ODP), use the IP address of an ODP. For direct connection, use the IP address of an OBServer node.$port: the port for connecting to OceanBase Database. For connection through ODP, the default value is2883, which can be customized when ODP is deployed. For direct connection, the default value is2881, which can be customized when OceanBase Database is deployed.$database_name: the name of the database to be accessed.Notice
The user for connecting to a tenant must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more information about user privileges, see Privilege types in MySQL-compatible mode.$user_name: the tenant account. For connection through ODP, the tenant account can be in theusername@tenant name#cluster nameorcluster name:tenant name:usernameformat. For direct connection, the tenant account is in theusername@tenant nameformat.$password: the account password.
For more information about connection strings, see Connect to an OceanBase tenant by using OBClient.
Here is an example:
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001@mysql001 -p****** -Dtest
Step 2: Register an account on the LLM platform
Obtain an OpenAI API key:
Log in to the OpenAI console.
In the upper-right corner, click API Keys.
Click Create API Key.
Fill in the required information and click Create API Key.
Configure the OpenAI API key as an environment variable:
For a Unix-based system (such as Ubuntu or macOS), run the following command in the terminal:
export OPENAI_API_KEY='your-api-key'For a Windows system, run the following command in the command prompt:
set OPENAI_API_KEY=your-api-key
Make sure to replace your-api-key with your actual OpenAI API key.
Step 3: Store vector data to OceanBase Database
Store vector data in OceanBase Database
Prepare test data.
Download the CSV file that contains pre-computed vectorized data. The CSV file contains 1,000 reviews of food and the last column contains the vectorized values. Therefore, you do not need to compute the vectors again. You can also use the following code to recalculate the embedding column (i.e., the vector column) and generate a new CSV file.
from openai import OpenAI import pandas as pd input_datapath = "./fine_food_reviews.csv" client = OpenAI() # You can change the text-embedding-ada-002 model to another one as needed. def embedding_text(text, model="text-embedding-ada-002"): # For more information about how to generate embedding vectors, see: https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663 res = client.embeddings.create(input=text, model=model) return res.data[0].embedding df = pd.read_csv(input_datapath, index_col=0) # It takes a few minutes to generate vectors. The OpenAI Embedding API is called for each row. df["embedding"] = df.combined.apply(embedding_text) output_datapath = './fine_food_reviews_self_embeddings.csv' df.to_csv(output_datapath)Run the following script to insert the test data into OceanBase Database. The directory where the script is located must be the same as the directory where the test data is located.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to OceanBase Database by using pyobvector. Replace @ in the username and password with %40.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The prepared test dataset has been vectorized. By default, the dataset is stored in the same directory as the Python script. If you have vectorized the dataset yourself, replace the file name with the name of the corresponding file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The vectorized column is the last field.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Specify the table name.
table_name = 'fine_food_reviews'
# Create the table if it does not exist.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index for the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header row.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data. Insert the data into the database in batches of 10 rows.
for i, row in enumerate(csvreader):
# The CSV file has nine fields: id, product_id, user_id, score, summary, text, combined, n_tokens, and embedding.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert data in batches of 10 rows.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the remaining rows (if any).
if batch:
client.insert(table_name,batch)
# Check the data in the table to make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of imported rows:{result[0]}")
Query OceanBase data
Save the following Python script as
openAIQuery.py.import os import sys import csv import json from pyobvector import * from sqlalchemy import func from openai import OpenAI # Get the command-line arguments. if len(sys.argv) != 2: print("Please enter a query statement.") sys.exit() queryStatement = sys.argv[1] # Connect to OceanBase Database by using pyobvector. Replace @ in the username or password with %40. client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test") openAIclient = OpenAI() # Define a function to generate text vectors. def generate_embeddings(text, model="text-embedding-ada-002"): # For more information about how to create embedding vectors, see: https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663 res = openAIclient.embeddings.create(input=text, model=model) return res.data[0].embedding def query_ob(query, tableName, vector_name="embedding", top_k=1): embedding = generate_embeddings(query) # Perform approximate nearest neighbor search. res = client.ann_search( table_name=tableName, vec_data=embedding, vec_column_name=vector_name, distance_func=func.l2_distance, topk=top_k, output_column_names=['combined'] ) for row in res: print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": ")) # Specify the table name. table_name = 'fine_food_reviews' query_ob(queryStatement,table_name,'embedding',1)Enter a question and get the relevant answer.
python3 openAIQuery.py 'pet food'The expected result is as follows:
Crack for dogs. These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.