Starting from V4.3.3, OceanBase Database provides support for vector data storage, vector indexes, and embedding vector search capabilities. You can store vectorized data in the OceanBase Database for subsequent retrieval and search.
LangChain is a framework for developing applications powered by language models. It enables applications to:
- Be context-aware: Connect language models to context sources (such as prompt instructions, a few examples, or content that requires a response).
- Have reasoning capabilities: Rely on language models to perform reasoning (such as determining how to answer based on the provided context, or deciding what actions to take).
This topic describes how to integrate the vector search functionality of OceanBase Database, Qwen, and LangChain to implement document-based question answering.
Prerequisites
You have deployed OceanBase Database V4.3.3 or later and created a MySQL-compatible tenant. After you create a tenant, perform the following steps. For more information, see Create a tenant.
You have a MySQL-compatible tenant, a MySQL database, and an account in your environment, and you have granted the account the read and write privileges.
You have installed Python 3.9 or later.
You have installed the dependencies.
python3 -m pip install -U langchain-oceanbase python3 -m pip install langchain_community python3 -m pip install dashscopeYou have set the
ob_vector_memory_limit_percentageparameter in the tenant to enable vector search. We recommend that you set the value to30for OceanBase Database versions earlier than V4.3.5 BP3, and to0for V4.3.5 BP3 and later. For more information about this parameter, see ob_vector_memory_limit_percentage.
Step 1: Obtain the database connection information
Obtain the database connection string from OceanBase Database deployment personnel or the administrator. For example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
The parameters are described as follows:
$host: the IP address for connecting to OceanBase Database. For connection through OceanBase Database Proxy (ODP), use the IP address of an ODP. For direct connection, use the IP address of an OBServer node.$port: the port for connecting to OceanBase Database. For connection through ODP, the default value is2883, which can be customized when ODP is deployed. For direct connection, the default value is2881, which can be customized when OceanBase Database is deployed.$database_name: the name of the database to be accessed.Notice
The user for connecting to a tenant must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more information about user privileges, see Privilege types in MySQL mode.$user_name: the tenant account. For connection through ODP, the format isusername@tenant name#cluster nameorcluster name:tenant name:username. For direct connection, the format isusername@tenant name.$password: the password of the account.
For more information about the connection string, see Connect to an OceanBase tenant by using OBClient.
Step 2: Build your AI assistant
Set the environment variable for the Qwen API key
Obtain the Qwen API key and configure the API key to an environment variable.
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
Load and split the document
Download the sample data and split it into chunks of approximately 1,000 characters each using CharacterTextSplitter.
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_oceanbase.vectorstores import OceanbaseVectorStore
import os
import requests
DASHSCOPE_API = os.environ.get("DASHSCOPE_API_KEY", "")
embeddings = DashScopeEmbeddings(
model="text-embedding-v1", dashscope_api_key=DASHSCOPE_API
)
url = "https://raw.githubusercontent.com/GITHUBear/langchain/refs/heads/master/docs/docs/how_to/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
f.write(res.text)
loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
Insert the data into OceanBase Database
connection_args = {
"host": "127.0.0.1",
"port": "2881",
"user": "root@sun",
"password": "",
"db_name": "test",
}
DEMO_TABLE_NAME = "demo_ann"
ob = OceanbaseVectorStore(
embedding_function=embeddings,
table_name=DEMO_TABLE_NAME,
connection_args=connection_args,
drop_old=True,
normalize=True,
)
res = ob.add_documents(documents=docs)
Vector search
This step demonstrates how to query the state_of_the_union.txt document for the phrase "What did the president say about Ketanji Brown Jackson".
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = ob.similarity_search_with_score(query, k=3)
for doc, score in docs_with_score:
print("-" * 80)
print("Score: ", score)
print(doc.page_content)
print("-" * 80)
The expected output is as follows:
--------------------------------------------------------------------------------
Score: 1.204783671324283
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score: 1.2146663629717394
It is going to transform America and put us on a path to win the economic competition of the 21st Century that we face with the rest of the world—particularly with China.
As I've told Xi Jinping, it is never a good bet to bet against the American people.
We’ll create good jobs for millions of Americans, modernizing roads, airports, ports, and waterways all across America.
And we'll do it all to withstand the devastating effects of the climate crisis and promote environmental justice.
We'll build a national network of 500,000 electric vehicle charging stations, begin to replace poisonous lead pipes—so every child—and every American—has clean water to drink at home and at school, provide affordable high-speed internet for every American—urban, suburban, rural, and tribal communities.
4,000 projects have already been announced.
And tonight, I'm announcing that this year we will start fixing over 65,000 miles of highway and 1,500 bridges in disrepair.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score: 1.2193955178945004
Vice President Harris and I ran for office with a new economic vision for America.
Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up
and the middle out, not from the top down.
Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well.
America used to have the best roads, bridges, and airports on Earth.
Now our infrastructure is ranked 13th in the world.
We won't be able to compete for the jobs of the 21st Century if we don’t fix that.
That's why it was so important to pass the Bipartisan Infrastructure Law—the most sweeping investment to rebuild America in history.
This was a bipartisan effort, and I want to thank the members of both parties who worked to make it happen.
We're done talking about infrastructure weeks.
We're going to have an infrastructure decade.
--------------------------------------------------------------------------------