OceanBase OceanBase offers features such as vector storage, vector indexing, and embedding-based vector search. You can store vectorized data in OceanBase Database, making it available for fast and efficient search.
Firecrawl enables developers to scrape high-quality data from any website for building AI applications. This tool offers advanced web scraping, crawling, and data extraction capabilities, efficiently transforming website content into clean markup or structured data to meet the needs of downstream AI workflows.
In this tutorial, we will show you how to build a Retrieval-Augmented Generation (RAG) pipeline using OceanBase and Firecrawl. This pipeline integrates Firecrawl for web data scraping, OceanBase for vector storage, and Jina AI for generating insightful, context-aware responses.
Prerequisites
You have deployed OceanBase Database V4.4.0 or later, and created a MySQL-compatible tenant. After creating the tenant, continue with the steps below.
Your environment includes an active MySQL-compatible tenant, a MySQL database, and a user account with read and write privileges.
Python 3.11 or above is installed.
Required dependencies are installed:
python3 -m pip install firecrawl-py pyobvector requests tqdmMake sure you have set the
ob_vector_memory_limit_percentageparameter in your instance to enable vector search. A recommended value is30. For details on configuring this parameter, refer to ob_vector_memory_limit_percentage.
Step 1: Get your database connection information
Reach out to your OceanBase administrator or deployment team to obtain the database connection string, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
$host: The IP address for connecting to OceanBase Database. If you are using OceanBase Database Proxy (ODP), use the ODP address. For direct connections, use the OBServer node IP.$port: The port number for connecting to OceanBase Database. The default for ODP is2883(can be customized during ODP deployment). For direct connections, the default is2881(customizable during OceanBase deployment).$database_name: The name of the database you want to access.Notice
The user connecting to the tenant must have
CREATE,INSERT,DROP, andSELECTprivileges on the database. For more details on user privileges, see privilege types in MySQL-compatible mode.$user_name: The user account for connecting to the tenant. For ODP, common formats areusername@tenant_name#cluster_nameorcluster_name:tenant_name:username; for direct connections, useusername@tenant_name.$password: The password for the account.
For more details about connection strings, see Connect to an OceanBase tenant using OBClient.
Step 2: Build your AI assistant
Use Firecrawl to crawl web pages and save the information to OceanBase vector for search.
Set environment variables
Obtain the Firecrawl API key and configure it with the OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export FIRECRAWL_API_KEY=YOUR_FIRECRAWL_API_KEY
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Sample code
Use Firecrawl to crawl OceanBase Database Overview
import os ,requests
from firecrawl import FirecrawlApp
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
from tqdm import tqdm
def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Scrape a website:
scrape_status = app.scrape(
url="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001970957",
formats=["markdown"]
)
markdown_content = scrape_status.markdown
# Process the scraped markdown content
sections = split_markdown_content(markdown_content)
Obtain the vector from Jina AI
Jina AI provides various models. You can choose a suitable model based on your needs. For more information, see Model usage. Here, we use jina-embeddings-v3 as an example. We define a generate_embeddings helper function to call the Jina AI API:
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3' # with dimension 1024.
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
return response.json()['data'][0]['embedding']
data = []
for i, section in enumerate(tqdm(sections, desc="Processing sections")):
try:
embedding = generate_embeddings(section)
truncated_content = section[:4900] if len(section) > 4900 else section
data.append({"content": truncated_content, "content_vec": embedding})
except Exception as e:
print(f"Error processing section {i}: {e}")
continue
Define the vector table structure and store the vector in OceanBase
Create a table named firecrawl_oceanbase_demo_documents that contains a content column for storing text and a content_vec column for storing vectors and vector index information:
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
table_name = "firecrawl_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(5000), nullable=False),
Column("content_vec", VECTOR(1024))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)
Perform semantic search
Generate a vector for the query text by using the Jina AI API. Then, search for the most relevant documents based on the cosine distance between the query vector and each vector in the vector table:
query = 'what is OceanBase High compatibility'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = client.ann_search(
table_name,
vec_data=query_embedding,
vec_column_name="content_vec",
distance_func=cosine_distance, # Use the cosine distance function.
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f'- ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected result
- ID: 5
content: High compatibility
OceanBase Database is highly compatible with most general features of Oracle and MySQL, and supports advanced features such as procedural language and triggers. OceanBase Migration Service (OMS), an automatic migration tool, is provided to support migration assessment and reverse synchronization to ensure data migration security when a core system is migrated to OceanBase Database in key industries such as finance, public governance, and communication service.
##
distance: 0.2341035693195166