Integrate OceanBase Cloud Vector with Firecrawl|OceanBase Cloud| docs|Distributed Database

Integrate OceanBase Cloud Vector with Firecrawl

Last Updated：2025-11-17 13:23:16 Updated

OceanBase Database V4.3.3 and later support vector storage, vector indexing, and embedding vector retrieval. You can store vectorized data in OceanBase Database for subsequent retrieval.

Firecrawl allows developers to crawl high-quality data from any website for building AI applications. This tool provides advanced web scraping, crawling, and data extraction capabilities, efficiently converting website content into clean markup language or structured data to meet the needs of downstream AI workflows.

This tutorial demonstrates how to use OceanBase Cloud and Firecrawl to build a retrieval-augmented generation (RAG) pipeline. This pipeline integrates Firecrawl for web data crawling, OceanBase Cloud for vector storage, and Jina AI for generating insightful, context-aware responses.

Prerequisites

A transactional instance is available in your environment. For instructions on how to create the instance, see Create an transactional instance.
You have created a MySQL-compatible tenant in the instance. For instructions on how to create the tenant, see Create a MySQL-compatible tenant.
You have a MySQL database and account available under the tenant, and you have granted read and write permissions to the database account. For more information, see Create an account and Create a database (MySQL only).
You are a project admin or instance admin and have the permissions required to read and write data in the instance. If not, contact your organization admin to grant the required permissions.
You have installed Python 3.11 or later.

You have installed the required dependencies.

python3 -m pip install firecrawl-py pyobvector requests tqdm

Step 1: Obtain the database connection information

Log in to the OceanBase Cloud console.
In the instance list page, expand the the information of the target instance.
Select Connect > Get Connection String under the target tenant.
In the pop-up window, select Public Network as the connection method.
Follow the prompts in the pop-up window to obtain the public endpoint and the connection string.

Step 2: Build your AI assistant

Use Firecrawl to crawl website information and save it in OceanBase vectors, and then perform a search.

Set environment variables

Obtain the Firecrawl API key and configure it along with the OceanBase connection information in the environment variables.

export OCEANBASE_DATABASE_URL=YOUR_OCEANBASE_DATABASE_URL
export OCEANBASE_DATABASE_USER=YOUR_OCEANBASE_DATABASE_USER
export OCEANBASE_DATABASE_DB_NAME=YOUR_OCEANBASE_DATABASE_DB_NAME
export OCEANBASE_DATABASE_PASSWORD=YOUR_OCEANBASE_DATABASE_PASSWORD
export FIRECRAWL_API_KEY=YOUR_FIRECRAWL_API_KEY
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY

Sample code

Use Firecrawl to crawl OceanBase Database Overview

import os ,requests
from firecrawl import FirecrawlApp
from sqlalchemy import Column, Integer, String
from pyobvector import ObVecClient, VECTOR, IndexParam, cosine_distance
from tqdm import tqdm

def split_markdown_content(content):
    return [section.strip() for section in content.split("# ") if section.strip()]

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

# Scrape a website:
scrape_status = app.scrape(
url="https://en.oceanbase.com/docs/common-oceanbase-database-10000000001970957",
formats=["markdown"]
)

markdown_content = scrape_status.markdown

# Process the scraped markdown content
sections = split_markdown_content(markdown_content)

Obtain Jina AI vectors

Jina AI provides various vector models. You can choose the appropriate model as needed. Here, we use jina-embeddings-v3 as an example. We define a generate_embeddings helper function to call the Jina AI vector API.

JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')

def generate_embeddings(text: str):
    JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
    JINAAI_HEADERS = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {JINAAI_API_KEY}'
    }
    JINAAI_REQUEST_DATA = {
        'input': [text],
        'model': 'jina-embeddings-v3'  # with dimension 1024.
    }
    response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
    return response.json()['data'][0]['embedding']

data = []

for i, section in enumerate(tqdm(sections, desc="Processing sections")):
    try:
        embedding = generate_embeddings(section)
        truncated_content = section[:4900] if len(section) > 4900 else section
        data.append({"content": truncated_content, "content_vec": embedding})
    except Exception as e:
        print(f"Error processing section {i}: {e}")
        continue

Define the vector table structure and store the vectors in OceanBase

Create a table named firecrawl_oceanbase_demo_documents with columns for storing text (content), vectors (content_vec), and vector index information:

OCEANBASE_DATABASE_URL = os.getenv('OCEANBASE_DATABASE_URL')
OCEANBASE_DATABASE_USER = os.getenv('OCEANBASE_DATABASE_USER')
OCEANBASE_DATABASE_DB_NAME = os.getenv('OCEANBASE_DATABASE_DB_NAME')
OCEANBASE_DATABASE_PASSWORD = os.getenv('OCEANBASE_DATABASE_PASSWORD')

client = ObVecClient(uri=OCEANBASE_DATABASE_URL, user=OCEANBASE_DATABASE_USER,password=OCEANBASE_DATABASE_PASSWORD,db_name=OCEANBASE_DATABASE_DB_NAME)
table_name = "firecrawl_oceanbase_demo_documents"
client.drop_table_if_exist(table_name)

cols = [
    Column("id", Integer, primary_key=True, autoincrement=True),
    Column("content", String(5000), nullable=False),
    Column("content_vec", VECTOR(1024))
]

# Create vector index
vector_index_params = IndexParam(
    index_name="idx_content_vec",
    field_name="content_vec",  
    index_type="HNSW",
    distance_metric="cosine"
)

client.create_table_with_index_params(
    table_name=table_name,
    columns=cols, 
    vidxs=[vector_index_params]
)

print('- Inserting Data to OceanBase...')
client.insert(table_name, data=data)

Perform semantic search

Generate the vector for the query text via the Jina AI API. Then, search for the most relevant documents based on the cosine distance between the query vector and each vector in the vector table:

query = 'what is OceanBase High compatibility'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)

res = client.ann_search(
    table_name,
    vec_data=query_embedding,
    vec_column_name="content_vec",
    distance_func=cosine_distance,  # Use the cosine distance function
    with_dist=True,
    topk=1,
    output_column_names=["id", "content"],
)

print('- The Most Relevant Document and Its Distance to the Query:')
for row in res.fetchall():
    print(f'- ID: {row[0]}\n'
          f'  content: {row[1]}\n'
          f'  distance: {row[2]}')

Expected results

- ID: 5
  content: High compatibility

OceanBase Database is highly compatible with most general features of Oracle and MySQL, and supports advanced features such as procedural language and triggers. OceanBase Migration Service (OMS), an automatic migration tool, is provided to support migration assessment and reverse synchronization to ensure data migration security when a core system is migrated to OceanBase Database in key industries such as finance, public governance, and communication service.

##
  distance: 0.2341035693195166