OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.
Firecrawl allows developers to crawl high-quality data from any website for building AI applications. This tool provides advanced web scraping, crawling, and data extraction capabilities, efficiently converting website content into clean markup language or structured data to meet the needs of downstream AI workflows.
This tutorial demonstrates how to use OceanBase Cloud and Firecrawl to build a retrieval-augmented generation (RAG) pipeline. This pipeline integrates Firecrawl for web data crawling, OceanBase Cloud for vector storage, and Jina AI for generating insightful, context-aware responses.
Prerequisites
A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later.
You have installed the required dependencies.
python3 -m pip install firecrawl-py pyobvector requests tqdm
Step 1: Obtain the database connection information
Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.
Step 2: Build your AI assistant
Use Firecrawl to crawl website information and save it in OceanBase vectors, and then perform a search.
Set environment variables
Obtain the Firecrawl API key and configure it along with the OceanBase connection information in the environment variables.
export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export FIRECRAWL_API_KEY=YOUR_FIRECRAWL_API_KEY
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Sample code
Use Firecrawl to crawl OceanBase Database Overview
import os ,requests
from firecrawl import FirecrawlApp
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
from tqdm import tqdm
def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Scrape a website:
scrape_status = app.scrape(
url="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001970957",
formats=["markdown"]
)
markdown_content = scrape_status.markdown
# Process the scraped markdown content
sections = split_markdown_content(markdown_content)
Obtain Jina AI vectors
Jina AI provides various vector models. You can choose the appropriate model as needed. Here, we use jina-embeddings-v3 as an example. We define a generate_embeddings helper function to call the Jina AI vector API.
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3' # with dimension 1024.
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
return response.json()['data'][0]['embedding']
data = []
for i, section in enumerate(tqdm(sections, desc="Processing sections")):
try:
embedding = generate_embeddings(section)
truncated_content = section[:4900] if len(section) > 4900 else section
data.append({"content": truncated_content, "content_vec": embedding})
except Exception as e:
print(f"Error processing section {i}: {e}")
continue
Define the vector table structure and store the vectors in OceanBase
Create a table named firecrawl_oceanbase_demo_documents with columns for storing text (content), vectors (content_vec), and vector index information:
OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')
client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
table_name = "firecrawl_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)
cols = [
Column("id", Integer, primary_key=True, autoincrement=True),
Column("content", String(5000), nullable=False),
Column("content_vec", VECTOR(1024))
]
# Create vector index
vector_index_params = IndexParam(
index_name="idx_content_vec",
field_name="content_vec",
index_type="HNSW",
distance_metric="cosine"
)
client.create_table_with_index_params(
table_name=table_name,
columns=cols,
vidxs=[vector_index_params]
)
print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)
Perform semantic search
Generate the vector for the query text via the Jina AI API. Then, search for the most relevant documents based on the cosine distance between the query vector and each vector in the vector table:
query = 'what is OceanBase High compatibility'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = client.ann_search(
table_name,
vec_data=query_embedding,
vec_column_name="content_vec",
distance_func=cosine_distance, # Use the cosine distance function
with_dist=True,
topk=1,
output_column_names=["id", "content"],
)
print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
print(f'- ID: {row[0]}\n'
f' content: {row[1]}\n'
f' distance: {row[2]}')
Expected results
- ID: 5
content: High compatibility
OceanBase Database is highly compatible with most general features of Oracle and MySQL, and supports advanced features such as procedural language and triggers. OceanBase Migration Service (OMS), an automatic migration tool, is provided to support migration assessment and reverse synchronization to ensure data migration security when a core system is migrated to OceanBase Database in key industries such as finance, public governance, and communication service.
##
distance: 0.2341035693195166