Stop Stitching, Start Building: Get Started with OceanBase seekdb

右侧logo

As AI applications scale, most teams hit the same wall: the Frankenstein stack.

To build a modern AI application, you are often forced to stitch together MySQL for metadata, Pinecone or Milvus for vectors, and Elasticsearch for keywords. Then comes the “Glue Code Tax”: hundreds of lines of brittle Python just to sync data, keep systems consistent, and merge query results. The result is fragile, complex, and expensive to operate.

Enter OceanBase seekdb, an AI-native search database that unifies relational, vector, full-text, JSON, and GIS data in a single ACID-compliant, MySQL-compatible engine.

In this post, we will skip the architecture talk and go straight to code. You go from a blank editor to a fully operational, ACID compliant hybrid search engine running entirely in user space.

Core Advantages of seekdb

Before diving into the code, it helps to understand what you gain by replacing the stitched stack with a unified engine.

  • Hybrid Search: Stop choosing between accuracy and recall. seekdb combines Vector Search (semantic meaning) with Full-Text Search (keyword precision) and Relational Filtering in a single SQL query path.
  • True Multi-Model: seekdb treats vectors, text, JSON, and scalars as a unified whole. You can perform complex joins across relational tables and vector indices without moving data.
  • AI Inside: Move the "brain" closer to the data. seekdb includes built-in functions for embedding generation, reranking, and LLM inference directly within the database engine.
  • SQL + ACID: seekdb uses an OceanBase derived storage engine with full transactions and durability. It is compatible with native MySQL drivers and tools, and exposes a unfied SQL query language for multi model data. Writes are immediately queryable after commit with full ACID guarantees.

Flexible Deployment

seekdb supports two deployment modes:

  • Embedded Mode: The engine runs in-process inside your Python app as a lightweight library. This fits local agents, tools, and prototypes.
  • Server Mode: The same client APIs can talk to a seekdb server instance, or to a full OceanBase cluster when you need high concurrency and distributed scale. Recommended for both testing and production.

In this guide, we use the Embedded mode to get you started. It is the fastest path to trying seekdb: no Docker, no sidecars, no network config—just Python.

Deploy seekdb in Embeded Mode

Before installation, ensure that your environment meets the following requirements:

  • Operating system: Linux (glibc >= 2.28)
  • Python version: Python 3.11 and later, with pip installed.
  • System architecture: x86_64, aarch64

Run the following command to install seekdb in the embeded mode:

pip install -U pyseekdb
Note:
If your pip version is low, upgrade pip first as promopted before installing.

This script demonstrates the “zero infrastructure” path. It initializes the database, ingests text with automatic embedding generation, and runs a semantic search.

Create a file named hello_seekdb.py and execute it:

import pyseekdb

# ==================== Step 1: Create Client Connection ====================
# Start in embedded mode (local SeekDB)
client = pyseekdb.Client()

# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"

# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
    name=collection_name,
)

print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")

# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents

documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text"
]

ids = ["id1", "id2", "id3", "id4", "id5"]

# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
    ids=ids,
    documents=documents,  # embeddings will be automatically generated
    metadatas=[
        {"category": "AI", "index": 0},
        {"category": "Programming", "index": 1},
        {"category": "Database", "index": 2},
        {"category": "AI", "index": 3},
        {"category": "NLP", "index": 4}
    ]
)

print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")

# ==================== Step 4: Query the Collection ====================
# The embedding function will automatically convert query text to query vector

# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"

results = collection.query(
    query_texts=query_text,  # Query text - will be embedded automatically
    n_results=3  # Return top 3 most similar documents
)

print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")

# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
    print(f"\nResult {i+1}:")
    print(f"  ID: {results['ids'][0][i]}")
    print(f"  Distance: {results['distances'][0][i]:.4f}")
    if results.get('documents'):
        print(f"  Document: {results['documents'][0][i]}")
    if results.get('metadatas'):
        print(f"  Metadata: {results['metadatas'][0][i]}")

# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")

Sample output:

>>> Creating collection: my_simple_collection
    - Dimension: 384
    - Embedding function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')
>>> Ingesting data (Auto-Embedding)...
    - Added 5 documents in 5.31s

>>> Querying for: 'artificial intelligence and machine learning'
    - Found 3 matches.

Result 1:
  ID: id1
  Distance: 0.3008
  Document: Machine learning is a subset of artificial intelligence
  Metadata: {'index': 0, 'category': 'AI'}

Result 2:
  ID: id4
  Distance: 0.5983
  Document: Neural networks are inspired by the human brain
  Metadata: {'index': 3, 'category': 'AI'}

Wrap Up:

You now have a working semantic search engine with zero infrastructure setup and just a few lines of logic. pyseekdbhandled the embeddings internally, so no external API key or separate model service was needed.

Pure vector search is excellent at understanding intent but weak on exact keywords, numbers, and proper nouns. Full-text search has the opposite tradeoff. Hybrid search combines both signals and ranks the merged result set.

seekdb exposes hybrid search in embedded mode through the hybrid_search() method. This uses the same embedded engine as Part 1—no extra services required.

Create hybrid_seekdb.py and execute it:

import pyseekdb
from pprint import pprint
import time

print(">>> Initializing SeekDB Client...")

# Initialize Embedded Engine
client = pyseekdb.Client()
collection_name = "quickstart_demo"

# Idempotency: Clean up previous run
try:
    client.delete_collection(collection_name)
except:
    pass

print(f">>> Creating collection '{collection_name}'...")
collection = client.create_collection(name=collection_name)

documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text",
]

print(">>> Indexing documents...")
# The engine performs Vectorization + Inverted Index building here
collection.add(
    ids=["id1", "id2", "id3", "id4", "id5"],
    documents=documents,
    metadatas=[
        {"tag": "ai"}, {"tag": "code"}, {"tag": "db"}, {"tag": "ai"}, {"tag": "ai"},
    ],
)
print(f"    - Indexed {len(documents)} documents.")

# ---------------------------------------------------------
# Hybrid Search: Vector('AI') + Keyword('learning') + RRF
# ---------------------------------------------------------
print("\n>>> Executing Hybrid Search...")
print("    Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)")

try:
    hybrid_results = collection.hybrid_search(
        query={
            # Full text condition (lexical match)
            # Note: The backslash before $ is for shell escaping; Python sees "\$contains"
            "where_document": {"\$contains": "learning"},
            "n_results": 5,
        },
        knn={
            # Semantic condition (vector match)
            "query_texts": ["artificial intelligence"],
            "n_results": 5,
        },
        # Reciprocal Rank Fusion (RRF)
        rank={"rrf": {}},
        n_results=3,
        include=["documents", "metadatas"],
    )

    print("\n--- RRF Re-ranked Results ---")
    pprint(hybrid_results)

except AttributeError:
    print("\n[NOTE] Feature unavailable. Your current pyseekdb version or embedded binary")
    print("       may not support 'hybrid_search'. This is a standard feature in OceanBase Server.")
except Exception as e:
    print(f"\n[ERROR] Hybrid search failed: {e}")

# Cleanup
client.delete_collection(collection_name)
print(f"\n>>> Cleanup complete.")
EOF

Sample output:

...
- Indexed 5 documents.

>>> Executing Hybrid Search...
    Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)

--- RRF Re-ranked Results ---
{'distances': [[Decimal('0.0328'), Decimal('0.0161'), Decimal('0.0159')]],
 'documents': [['Machine learning is a subset of artificial intelligence',
                'Natural language processing helps computers understand text',
                'Neural networks are inspired by the human brain']],
 'ids': [['id1', 'id5', 'id4']],
 'metadatas': [[{'tag': 'ai'}, {'tag': 'ai'}, {'tag': 'ai'}]]}

What this script does:

  • The query block: Runs a full-text search on the document body, retrieving candidates that contain the specific word "learning."
  • The knn block: Runs a vector search using the collection’s embedding function on the concept "artificial intelligence."
  • The rank block: Fuses both rankings using Reciprocal Rank Fusion. This algorithm normalizes the scores from the text search and vector search to create a unified ranking without manual weighting.
  • n_results=3: Returns the top three hits after fusion.

Wrap-up:

You now have a single embedded engine that handles:

  • Combined keyword plus semantic retrieval with hybrid_search().
  • Indexing, storage, and ranking—all inside one process, with no extra services and no glue code tax.

From Prototype to Production

You’ve just experienced the core capabilities of OceanBase seekdb using its lightweight embedded mode. It allowed you to build a hybrid search engine in seconds without setting up a server.

Note that the code you just wrote is compatible with the Server mode. When your application demands high concurrency, massive storage, or multi-tenant isolation, you can switch to the distributed server deployment without rewriting your query logic.

Prototype: client = pyseekdb.Client()
Production: client = pyseekdb.Client(host="...", port=2881)

Ready to build?


ICON_SHARE
ICON_SHARE
linkedin
Contact Us