GPT4All is an open-source ecosystem designed to enable large language models (LLMs) to run efficiently on consumer-grade hardware. This topic describes how to integrate GPT4All with the vector search feature of OceanBase Database to perform text vectorization, vector storage, and semantic search. By using the Embed4All model in GPT4All to generate text vectors and combining them with the vector index and similarity search capabilities of OceanBase Database, you can build a retrieval-augmented generation (RAG) pipeline and implement AI applications such as similarity search.
Compatibility
Component |
Description |
|---|---|
| OceanBase Database | ≥ V4.3.3 |
Prerequisites
Before you use GPT4All, make sure that:
- You have deployed OceanBase Database and created a MySQL user tenant. For more information, see Create a tenant.
- You have a transactional (MySQL) cluster instance or a shared instance available. If you do not have an available instance, apply for a free trial of OceanBase Cloud. For more information, see Free trial rules and activation method.
- You have a MySQL tenant, a MySQL database, and a MySQL account available, and the database account has read and write permissions. For more information about how to create a MySQL account and a MySQL database, see Create an account and Create a database (MySQL only).
- You have the project administrator or instance administrator permission to read and write to the instance. If you do not have the permission, contact the organization administrator to grant the permission.
- You have installed Python 3.11 or later.
- You have set the
ob_vector_memory_limit_percentageparameter to enable vector search. We recommend that you set this parameter to30. For more information about how to calculate this parameter, see ob_vector_memory_limit_percentage.
Procedure
Step 1: Obtain the connection string of OceanBase Database
Contact the OceanBase Database deployment personnel to obtain the connection string. For example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameter description:
$host: the IP address for connection. For ODP connection, use the ODP address. For direct connection, use the OBServer IP address.$port: the connection port. For ODP connection, the default value is2883. For direct connection, the default value is2881.$database_name: the name of the database.Notice
The user for connecting to the tenant must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more information about user privileges, see Privilege types in MySQL mode.$user_name: the connection account. For ODP connection, the format isuser@tenant#clusterorcluster:tenant:user. For direct connection, the format isuser@tenant.$password: the password of the account.
For more information about the connection string, see Connect to an OceanBase tenant by using OBClient.
Example:
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001@mysql001 -p****** -Dtest
Step 2: Install the required Python dependencies
Install the required Python dependencies:
python3 -m pip install gpt4all pyobvector sqlalchemy
Step 3: Set environment variables
Configure the OceanBase connection information to the environment variables:
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
Step 4: Enable vector search
Text vectorization
The vector dimension of the default model all-MiniLM-L6-v2 of GPT4All Embed4All is 384. Define a
generate_embeddingshelper function to call the Embed4All API:import os from gpt4all import Embed4All from sqlalchemy import Column, Integer, String from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance embedder = Embed4All() # Step 1. Text Data Vectorization def generate_embeddings(text: str): vec = embedder.embed(text) if isinstance(vec[0], (int, float)): return list(vec) return list(vec[0]) TEXTS = [ 'Vector search stores embeddings in a database column and retrieves nearest neighbors by distance, often as part of a RAG pipeline.', 'OceanBase is a distributed relational database built for high availability and horizontal scale; it speaks MySQL and Oracle compatible protocols in many deployments.', 'For analytics and transactions together, OceanBase supports HTAP workloads so operational and reporting queries can share one cluster with strong consistency.', ] data = [] for text in TEXTS: # Generate the embedding for the text via GPT4All Embed4All. embedding = generate_embeddings(text) data.append({ 'content': text, 'content_vec': embedding }) print(f"Successfully processed {len(data)} texts")Create a vector table
Create a table named
gpt4all_oceanbase_demo_documentsthat contains thecontentcolumn for storing text, thecontent_veccolumn for storing vectors, and vector index information. Store the vector data in OceanBase:# Step 2. Connect OceanBase Serverless OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL') OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER') OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME') OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD') client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME) # Step 3. Create the vector table. table_name = "gpt4all_oceanbase_demo_documents" client.drop_table_if_exist(table_name) cols = [ Column("id", Integer, primary_key=True, autoincrement=True), Column("content", String(500), nullable=False), Column("content_vec", VECTOR(384)) ] # Create vector index vector_index_params = IndexParam( index_name="idx_content_vec", field_name="content_vec", index_type="HNSW", metric_type="cosine", ) client.create_table_with_index_params( table_name=table_name, columns=cols, vidxs=[vector_index_params] ) print('- Inserting Data to OceanBase...') client.insert(table_name, data=data)Semantic search
Generate a vector for the query text by using the GPT4All API. Then, search for the most relevant document based on the cosine distance between the query vector and each vector in the vector table:
# Step 4. Query the most relevant document based on the query. query = 'What is OceanBase?' # Generate the embedding for the query via GPT4All Embed4All. query_embedding = generate_embeddings(query) res = client.ann_search( table_name, vec_data=query_embedding, vec_column_name="content_vec", distance_func=cosine_distance, # Use the cosine distance function. with_dist=True, topk=1, output_column_names=["id", "content"], ) print('- The Most Relevant Document and Its Distance to the Query:') for row in res.fetchall(): print(f' - ID: {row[0]}\n' f' content: {row[1]}\n' f' distance: {row[2]}')
Verify the result
After you execute the preceding code, the expected output is as follows:
- ID: 2
content: OceanBase is a distributed relational database built for high availability and horizontal scale; it speaks MySQL and Oracle compatible protocols in many deployments.
distance: 0.23738205432891846
This indicates that the system has successfully found the most relevant document to the query "What is OceanBase?" and calculated the cosine distance. A smaller distance value indicates a higher similarity.
