Background information
In the information explosion era, users often need to quickly retrieve necessary information from massive amounts of data. Efficient retrieval systems are required to quickly locate content of interest in online literature databases, e-commerce product catalogs, and rapidly growing multimedia content libraries. As the amount of data continues to increase, traditional keyword-based search methods cannot meet users' needs for both accuracy and speed. This is where vector search technology comes in. It encodes different types of data, such as text, images, and audio, into mathematical vectors and performs search operations in the vector space. This allows the system to capture the deep semantic information of data and provide more accurate and efficient search results.
This topic will show you how to build an intelligent document Q&A assistant using OceanBase's vector search capability.
Architecture
The intelligent Q&A assistant stores documents as vectors within an OceanBase database. When a user asks a question through the user interface (UI), the application embeds the question into vectors by using the BGE-M3 model and retrieves similar vectors from the database. After obtaining the documents corresponding to the similar vectors, the application sends them along with the user's question to the Large Language Model (LLM). The LLM then generates a more accurate answer based on the provided documents.

Prerequisites
You have deployed OceanBase Database V4.3.3 or later and created a MySQL tenant. For more information about how to deploy an OceanBase cluster, see Deployment overview.
The MySQL tenant you created has the
INSERTandSELECTprivileges. For more information about how to configure privileges, see Grant direct privileges.You have created a database. For more information about how to create a database, see Create a database.
The vector search feature is enabled for the database. For more information about the vector search feature, see Perform fast vector search by using SQL.
obclient> ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;You have installed Python 3.9 or later.
You have installed Poetry.
python3 -m ensurepip python3 -m pip install poetry
Step 1: Register for an LLM platform account
- Register for an account with Alibaba Cloud Model Studio, activate the model service, and obtain an API key.
Notice
- Tongyi Qwen LLM provides a certain amount of free usage. Please monitor your usage during operation, as exceeding the free quota will incur charges.
- This topic uses Tongyi Qwen LLM as an example to demonstrate how to build a Q&A chatbot. You can also choose to use other LLMs. If you use a different LLM, remember to update the
API_KEY,LLM_BASE_URL, andLLM_MODELfields in the.envfile accordingly.



Step 2: Build your AI assistant
Clone the code repository
git clone https://gitee.com/oceanbase-devhub/ai-workshop-2024
cd ai-workshop-2024
Install the dependencies
poetry install
Set environment variables
cp .env.example .env
# If you are using the LLM capabilities provided by Tongyi Qwen, update the API_KEY and OPENAI_EMBEDDING_API_KEY with the API KEY you obtained from the Alibaba Cloud Model Studio console. Also, update the variables starting with DB_ with your database connection information, then save the file.
vi .env
Connect to the database
You can use the script we have prepared to test the database connection and ensure that the related environment variables are set correctly:
bash utils/connect_db.sh
# If you successfully enter the MySQL connection, it means the environment variables have been set correctly.
Prepare document corpus
This step involves cloning OceanBase's open-source documentation repository, processing the documentation, and converting the documents into vector data, which is then stored in an OceanBase database.
Clone and process the documentation repository.
This step involves downloading and processing a large number of OceanBase documents, which will take some time.
git clone --single-branch --branch V4.3.3 https://github.com/oceanbase/oceanbase-doc.git doc_repos/oceanbase-doc # If your access to GitHub is slow, you can use the following command to clone the Gitee mirror version. git clone --single-branch --branch V4.3.4 https://gitee.com/oceanbase-devhub/oceanbase-doc.git doc_repos/oceanbase-docStandardize the document formatting.
Since some files in OceanBase's documentation use
====and----to indicate first-level and second-level headings, in this step we will convert them to the standard#and##notation.# Convert headings to standard Markdown format. poetry run python convert_headings.py \ doc_repos/oceanbase-doc/en-US \Convert the documents to vectors and insert them into the OceanBase database.
We provide the
embed_docs.pyscript, which, after you specify the document directory and corresponding component, will scan all Markdown files in that directory. The script splits long documents into smaller chunks, converts them into vectors using an embedding model, and then inserts the chunk content, embedded vectors, and chunk metadata (in JSON format, including the document title, relative path, component name, chunk title, and hierarchical headings) into a single table in OceanBase as reference data.To save time, we only process a few documents related to vector search from the many available OceanBase documents. After you open the chat interface in Step 6, your questions about OceanBase’s vector search capabilities will receive more accurate answers.
# Generate document vectors and metadata. poetry run python embed_docs.py --doc_base doc_repos/oceanbase-doc/en-US/640.ob-vector-search
Start the UI chat interface
Run the following command to start the chat interface:
poetry run streamlit run --server.runOnSave false chat_ui.py
Access the URL provided in the terminal to open the chatbot application.
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://172.xxx.xxx.xxx:8501
External URL: http://xxx.xxx.xxx.xxx:8501 # This is the URL you can access from your browser
Example
Notice
Since this application is built using OceanBase documentation, please ask questions related to OceanBase.
