Tongyi Qianwen (Qwen) is a large language model developed by Alibaba Cloud. It is used to understand and analyze user inputs. You can use the API of Qwen in the Alibaba Cloud Model Studio of Alibaba Cloud.
Starting from version V4.3.3, OceanBase Database introduces support for vector storage, vector indexing, and embedding vector search.
By leveraging Qwen's APIs, you can store vectorized data in OceanBase Database and utilize its vector search capabilities to efficiently query related data.
Prerequisites
You have installed OceanBase Database V4.3.3 or later and created a MySQL tenant. After you create a tenant, follow the steps below.
You have a MySQL tenant, database, and account in your environment that you can use, and the database account has been granted read and write privileges.
You have installed Python 3.9 or later and the corresponding pip.
You have installed poetry, pyobvector, and the DashScope SDK.
pip install poetry pip install pyobvector pip install dashscopeYou have obtained an API key for Qwen.
You have set the
ob_vector_memory_limit_percentageparameter in the tenant to enable vector search. We recommend that you set the value to30for OceanBase Database versions earlier than V4.3.5 BP3, and to0for V4.3.5 BP3 and later. For more information about this parameter, see ob_vector_memory_limit_percentage.
Step 1: Obtain the connection string of OceanBase Database
Obtain the database connection string from the deployment personnel or administrator of OceanBase Database, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
The parameters are described as follows:
$host: the IP address for connecting to OceanBase Database. For connection through OceanBase Database Proxy (ODP), use the IP address of an ODP. For direct connection, use the IP address of an OBServer node.$port: the port for connecting to OceanBase Database. For connection through ODP, the default value is2883, which can be customized when ODP is deployed. For direct connection, the default value is2881, which can be customized when OceanBase Database is deployed.$database_name: the name of the database to be accessed.Notice
The user for connecting to a tenant must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more information about user privileges, see Privilege types in MySQL mode.$user_name: the tenant account. For connection through ODP, the tenant account can be in theusername@tenant name#cluster nameorcluster name:tenant name:usernameformat. For direct connection, the tenant account is in theusername@tenant nameformat.$password: the account password.
For more information about connection strings, see Connect to an OceanBase tenant by using OBClient.
Step 2: Configure the environment variable for the Qwen API key
For a Unix-based system (such as Ubuntu or MacOS), run the following command in the terminal:
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
For Windows, run the following command in the command prompt:
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
Make sure to replace YOUR_DASHSCOPE_API_KEY with your actual Qwen API key.
Step 3: Store vector data to OceanBase Database
Prepare test data. Download the CSV file that contains pre-computed vectorized data. The CSV file contains 1,000 reviews of food and the last column contains the vectorized values. Therefore, you do not need to compute the vectors again. You can also use the following code to recalculate the embedding column (i.e., the vector column) and generate a new CSV file.
import dashscope import pandas as pd input_datapath = "./fine_food_reviews.csv" # You can change the text_embedding_v1 model to another one as needed. def generate_embeddings(text): rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text) embeddings = [record['embedding'] for record in rsp.output['embeddings']] return embeddings if isinstance(text, list) else embeddings[0] df = pd.read_csv(input_datapath, index_col=0) # It takes a few minutes to generate the vectors. This example calls the Qwen Embedding API for each row. df["embedding"] = df.combined.apply(generate_embeddings) output_datapath = './fine_food_reviews_self_embeddings.csv' df.to_csv(output_datapath)Run the following script to insert the test data into OceanBase Database. The directory where the script is located must be the same as that where the test data is located.
import os import sys import csv import json from pyobvector import * from sqlalchemy import Column, Integer, String # Connect to OceanBase Database by using pyobvector. Replace @ in the username and password with %40. client = ObVecClient(uri="host:port", user="username",password="****",db_name="test") # The prepared test dataset has been vectorized. By default, the dataset is stored in the same directory as the Python script. If you have vectorized the dataset yourself, replace the file name with the name of the vectorized dataset. file_name = "fine_food_reviews.csv" file_path = os.path.join("./", file_name) # Define the columns. The vectorized column is the last field. cols = [ Column('id', Integer, primary_key=True, autoincrement=False), Column('product_id', String(256), nullable=True), Column('user_id', String(256), nullable=True), Column('score', Integer, nullable=True), Column('summary', String(2048), nullable=True), Column('text', String(8192), nullable=True), Column('combined', String(8192), nullable=True), Column('n_tokens', Integer, nullable=True), Column('embedding', VECTOR(1536)) ] # Specify the table name. table_name = 'fine_food_reviews' # Create the table if it does not exist. if not client.check_table_exists(table_name): client.create_table(table_name,columns=cols) # Create an index for the vector column. client.create_index( table_name=table_name, is_vec_index=True, index_name='vidx', column_names=['embedding'], vidx_params='distance=l2, type=hnsw, lib=vsag', ) # Open and read the CSV file. with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile: csvreader = csv.reader(csvfile) # Read the header row. headers = next(csvreader) print("Headers:", headers) batch = [] # Store the data. Insert the data to the database in batches of 10 rows. for i, row in enumerate(csvreader): # The CSV file has nine fields: id, product_id, user_id, score, summary, text, combined, n_tokens, and embedding. if not row: break food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\ 'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])} batch.append(food_review_line) # Insert the data in batches of 10 rows. if (i + 1) % 10 == 0: client.insert(table_name,batch) batch = [] # Clear the cache. # Insert the remaining rows (if any). if batch: client.insert(table_name,batch) # Check the data in the table to make sure that all the data has been inserted. count_sql = f"select count(*) from {table_name};" cursor = client.perform_raw_text_sql(count_sql) result = cursor.fetchone() print(f"Total number of imported rows:{result[0]}")
Step 4: Query OceanBase data
Save the following Python script as
query.py.import os import sys import csv import json from pyobvector import * from sqlalchemy import func import dashscope # Get the command-line parameters. if len(sys.argv) != 2: print("Please enter a query statement.") sys.exit() queryStatement = sys.argv[1] # Connect to OceanBase Database by using pyobvector. Replace @ with %40 in the username or password. client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test") # Define a function to generate text vectors. def generate_embeddings(text): rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text) embeddings = [record['embedding'] for record in rsp.output['embeddings']] return embeddings if isinstance(text, list) else embeddings[0] def query_ob(query, tableName, vector_name="embedding", top_k=1): embedding = generate_embeddings(query) # Perform approximate nearest neighbor search. res = client.ann_search( table_name=tableName, vec_data=embedding, vec_column_name=vector_name, distance_func=func.l2_distance, topk=top_k, output_column_names=['combined'] ) for row in res: print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": ")) # Specify the table name. table_name = 'fine_food_reviews' query_ob(queryStatement,table_name,'embedding',1)Enter a question and output the relevant answer.
python3 query.py 'pet food'The expected result is as follows:
This is so good!: I purchased this after my sister sent a small bag to me in a gift box. I loved it so much I wanted to find it to buy for myself and keep it around. I always look on Amazon because you can find everything here and true enough, I found this wonderful candy. It is nice to keep in your purse for when you are out and about and get a dry throat or a tickle in the back of your throat. It is also nice to have in a candy dish at home for guests to try.