Integrate OceanBase vector search with OpenAI|V4.3.5| docs|Distributed Database

Integrate OceanBase vector search with OpenAI

Last Updated：2025-11-19 12:43:31 Updated

OpenAI is an artificial intelligence company that has developed several large language models. These models excel at understanding and generating natural language, making them highly effective for tasks such as text generation, answering questions, and engaging in conversations. Access to these models is available through an API.

OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. By using OpenAI's API, you can convert data into vectors, store these vectors in OceanBase, and then take advantage of OceanBase's vector search capabilities to find relevant data.

Prerequisites

You have installed OceanBase Database V4.3.3 or later and created a MySQL-compatible tenant. After you create a tenant, follow the steps below. For more information, see Create a tenant.
You have a MySQL-compatible tenant, database, and account in your environment, and the database account has read and write privileges.
You have installed Python 3.9 or later and pip.
You have installed Poetry, Pyobvector, and OpenAI SDK. The installation commands are as follows:
```
python3 pip install poetry
python3 pip install pyobvector
python3 pip install openai
```
You have obtained an OpenAI API key.
You have set the ob_vector_memory_limit_percentage parameter in the tenant to enable vector search. We recommend that you set the value to 30 for OceanBase Database versions earlier than V4.3.5 BP3, and to 0 for V4.3.5 BP3 and later. For more information about this parameter, see ob_vector_memory_limit_percentage.

Step 1: Obtain the connection string of OceanBase

Contact the OceanBase deployment engineer or administrator to obtain the connection string of OceanBase, for example:

obclient -h$host -P$port -u$user_name -p$password -D$database_name

Parameters:

$host: the IP address for connecting to OceanBase Database. For connection through OceanBase Database Proxy (ODP), use the IP address of an ODP. For direct connection, use the IP address of an OBServer node.
$port: the port for connecting to OceanBase Database. For connection through ODP, the default value is 2883, which can be customized when ODP is deployed. For direct connection, the default value is 2881, which can be customized when OceanBase Database is deployed.
$database_name: the name of the database to be accessed.

Notice

The user for connecting to a tenant must have the CREATE, INSERT, DROP, and SELECT privileges on the database. For more information about user privileges, see Privilege types in MySQL mode.
$user_name: the tenant account. For connection through ODP, the tenant account can be in the username@tenant name#cluster name or cluster name:tenant name:username format. For direct connection, the tenant account is in the username@tenant name format.
$password: the account password.

For more information about connection strings, see Connect to an OceanBase tenant by using OBClient.

Here is an example:

obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001@mysql001 -p****** -Dtest

Step 2: Register an account on the LLM platform

Obtain an OpenAI API key.

Log in to the OpenAI platform.
Click API Keys in the upper-right corner.
Click Create API Key.
Specify the required information and click Create API Key.

Specify the API key for the relevant environment variable.

For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
```
export OPENAI_API_KEY='your-api-key'
```
For a Windows system, you can run the following command in Command Prompt:
```
set OPENAI_API_KEY=your-api-key
```

You must replace your-api-key with the actual OpenAI API key.

Step 3: Store vector data in OceanBase

Store vector data in OceanBase

Prepare test data.

Download the CSV file that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:

from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# Here the text-embedding-ada-002 model is used. You can change the model as needed.
def embedding_text(text, model="text-embedding-ada-002"):
    # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
    res = client.embeddings.create(input=text, model=model)
    return res.data[0].embedding
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row.
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)

Run the following script to insert the test data into OceanBase. The script must be located in the same directory as the test data.

import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to OceanBase by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The last column is a vector column.
cols = [
    Column('id', Integer, primary_key=True, autoincrement=False),
    Column('product_id', String(256), nullable=True),
    Column('user_id', String(256), nullable=True),
    Column('score', Integer, nullable=True),
    Column('summary', String(2048), nullable=True),
    Column('text', String(8192), nullable=True),
    Column('combined', String(8192), nullable=True),
    Column('n_tokens', Integer, nullable=True),
    Column('embedding', VECTOR(1536))
]
# Define the table name.
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
    client.create_table(table_name,columns=cols)
    # Create an index on the vector column.
    client.create_index(
        table_name=table_name,
        is_vec_index=True,
        index_name='vidx',
        column_names=['embedding'],
        vidx_params='distance=l2, type=hnsw, lib=vsag',
    )
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile)
    # Read the header line.
    headers = next(csvreader)
    print("Headers:", headers)
    batch = [] # Store data by inserting 10 rows into the database each time.
    for i, row in enumerate(csvreader):
        # The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
        if not row:
            break
        food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
        'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
        batch.append(food_review_line)
        # Insert 10 rows each time.
        if (i + 1) % 10 == 0:
            client.insert(table_name,batch)
            batch = []  # Clear the cache.
    # Insert the rest rows, if any.
    if batch:
        client.insert(table_name,batch)
# Check the data in the table and make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of inserted rows:{result[0]}")

Query OceanBase data

Save the following Python script and name it as openAIQuery.py.

    import os
    import sys
    import csv
    import json
    from pyobvector import *
    from sqlalchemy import func
    from openai import OpenAI
    # Obtain command-line options.
    if len(sys.argv) != 2:
        print("Enter a query statement." )
        sys.exit()
    queryStatement = sys.argv[1]
    # Connect to OceanBase by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
    client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
    openAIclient = OpenAI()
    # Define the function for generating text vectors.
    def generate_embeddings(text, model="text-embedding-ada-002"):
        # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
        res = openAIclient.embeddings.create(input=text, model=model)
        return res.data[0].embedding

    def query_ob(query, tableName, vector_name="embedding", top_k=1):
        embedding = generate_embeddings(query)
        # Perform an approximate nearest neighbor search (ANNS).
        res = client.ann_search(
            table_name=tableName,
            vec_data=embedding,
            vec_column_name=vector_name,
            distance_func=func.l2_distance,
            topk=top_k,
            output_column_names=['combined']
        )
        for row in res:
            print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
    # Specify the table name.
    table_name = 'fine_food_reviews'
    query_ob(queryStatement,table_name,'embedding',1)

Enter a question for an answer.

python3 openAIQuery.py 'pet food'

The expected result is as follows:

Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.