OpenAI is an artificial intelligence company that has developed several large language models. These models excel at understanding and generating natural language, making them highly effective for tasks such as text generation, answering questions, and engaging in conversations. Access to these models is available through an API.
OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. By using OpenAI's API, you can convert data into vectors, store these vectors in OceanBase, and then take advantage of OceanBase's vector search capabilities to find relevant data.
Prerequisites
You have installed OceanBase Database V4.3.3 or later and created a MySQL-compatible tenant. After you create a tenant, follow the steps below. For more information, see Create a tenant.
You have a MySQL-compatible tenant, database, and account in your environment, and the database account has read and write privileges.
You have installed Python 3.9 or later and pip.
You have installed Poetry, Pyobvector, and OpenAI SDK. The installation commands are as follows:
python3 pip install poetry python3 pip install pyobvector python3 pip install openaiYou have obtained an OpenAI API key.
You have set the
ob_vector_memory_limit_percentageparameter in the tenant to enable vector search. We recommend that you set the value to30for OceanBase Database versions earlier than V4.3.5 BP3, and to0for V4.3.5 BP3 and later. For more information about this parameter, see ob_vector_memory_limit_percentage.
Step 1: Obtain the connection string of OceanBase
Contact the OceanBase deployment engineer or administrator to obtain the connection string of OceanBase, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
$host: the IP address for connecting to OceanBase Database. For connection through OceanBase Database Proxy (ODP), use the IP address of an ODP. For direct connection, use the IP address of an OBServer node.$port: the port for connecting to OceanBase Database. For connection through ODP, the default value is2883, which can be customized when ODP is deployed. For direct connection, the default value is2881, which can be customized when OceanBase Database is deployed.$database_name: the name of the database to be accessed.Notice
The user for connecting to a tenant must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more information about user privileges, see Privilege types in MySQL mode.$user_name: the tenant account. For connection through ODP, the tenant account can be in theusername@tenant name#cluster nameorcluster name:tenant name:usernameformat. For direct connection, the tenant account is in theusername@tenant nameformat.$password: the account password.
For more information about connection strings, see Connect to an OceanBase tenant by using OBClient.
Here is an example:
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001@mysql001 -p****** -Dtest
Step 2: Register an account on the LLM platform
Obtain an OpenAI API key.
Log in to the OpenAI platform.
Click API Keys in the upper-right corner.
Click Create API Key.
Specify the required information and click Create API Key.
Specify the API key for the relevant environment variable.
For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
export OPENAI_API_KEY='your-api-key'For a Windows system, you can run the following command in Command Prompt:
set OPENAI_API_KEY=your-api-key
You must replace your-api-key with the actual OpenAI API key.
Step 3: Store vector data in OceanBase
Store vector data in OceanBase
Prepare test data.
Download the CSV file that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
from openai import OpenAI import pandas as pd input_datapath = "./fine_food_reviews.csv" client = OpenAI() # Here the text-embedding-ada-002 model is used. You can change the model as needed. def embedding_text(text, model="text-embedding-ada-002"): # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663. res = client.embeddings.create(input=text, model=model) return res.data[0].embedding df = pd.read_csv(input_datapath, index_col=0) # It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row. df["embedding"] = df.combined.apply(embedding_text) output_datapath = './fine_food_reviews_self_embeddings.csv' df.to_csv(output_datapath)Run the following script to insert the test data into OceanBase. The script must be located in the same directory as the test data.
import os import sys import csv import json from pyobvector import * from sqlalchemy import Column, Integer, String # Connect to OceanBase by using pyobvector and replace the at (@) sign in the username and password with %40, if any. client = ObVecClient(uri="host:port", user="username",password="****",db_name="test") # The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file. file_name = "fine_food_reviews.csv" file_path = os.path.join("./", file_name) # Define columns. The last column is a vector column. cols = [ Column('id', Integer, primary_key=True, autoincrement=False), Column('product_id', String(256), nullable=True), Column('user_id', String(256), nullable=True), Column('score', Integer, nullable=True), Column('summary', String(2048), nullable=True), Column('text', String(8192), nullable=True), Column('combined', String(8192), nullable=True), Column('n_tokens', Integer, nullable=True), Column('embedding', VECTOR(1536)) ] # Define the table name. table_name = 'fine_food_reviews' # If the table does not exist, create it. if not client.check_table_exists(table_name): client.create_table(table_name,columns=cols) # Create an index on the vector column. client.create_index( table_name=table_name, is_vec_index=True, index_name='vidx', column_names=['embedding'], vidx_params='distance=l2, type=hnsw, lib=vsag', ) # Open and read the CSV file. with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile: csvreader = csv.reader(csvfile) # Read the header line. headers = next(csvreader) print("Headers:", headers) batch = [] # Store data by inserting 10 rows into the database each time. for i, row in enumerate(csvreader): # The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`. if not row: break food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\ 'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])} batch.append(food_review_line) # Insert 10 rows each time. if (i + 1) % 10 == 0: client.insert(table_name,batch) batch = [] # Clear the cache. # Insert the rest rows, if any. if batch: client.insert(table_name,batch) # Check the data in the table and make sure that all data has been inserted. count_sql = f"select count(*) from {table_name};" cursor = client.perform_raw_text_sql(count_sql) result = cursor.fetchone() print(f"Total number of inserted rows:{result[0]}")
Query OceanBase data
Save the following Python script and name it as
openAIQuery.py.import os import sys import csv import json from pyobvector import * from sqlalchemy import func from openai import OpenAI # Obtain command-line options. if len(sys.argv) != 2: print("Enter a query statement." ) sys.exit() queryStatement = sys.argv[1] # Connect to OceanBase by using pyobvector and replace the at (@) sign in the username and password with %40, if any. client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test") openAIclient = OpenAI() # Define the function for generating text vectors. def generate_embeddings(text, model="text-embedding-ada-002"): # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663. res = openAIclient.embeddings.create(input=text, model=model) return res.data[0].embedding def query_ob(query, tableName, vector_name="embedding", top_k=1): embedding = generate_embeddings(query) # Perform an approximate nearest neighbor search (ANNS). res = client.ann_search( table_name=tableName, vec_data=embedding, vec_column_name=vector_name, distance_func=func.l2_distance, topk=top_k, output_column_names=['combined'] ) for row in res: print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": ")) # Specify the table name. table_name = 'fine_food_reviews' query_ob(queryStatement,table_name,'embedding',1)Enter a question for an answer.
python3 openAIQuery.py 'pet food'The expected result is as follows:
Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.