Stop Stitching, Start Building: Get Started with OceanBase seekdb

Mike Liu
Mike Liu
Published on December 3, 2025
10 minute read
Key Takeaways
  • seekdb is an AI-native search database that unifies relational, vector, full-text, JSON, and GIS data in one MySQL-compatible, ACID-compliant engine — replacing the typical Frankenstein stack of MySQL + Pinecone + Elasticsearch.
  • Two hands-on scripts in this guide: pure vector search (semantic recall in ~10 lines) and hybrid search (vector + full-text + RRF fusion in a single hybrid_search() call), both running in embedded mode with zero infrastructure.
  • The same code runs in production: swap Client() for Client(host="...", port=2881) to move from an in-process prototype to a distributed OceanBase cluster.

As AI applications scale, most teams hit the same wall: the Frankenstein stack.

To build a modern AI application, you are often forced to stitch together MySQL for metadata, Pinecone or Milvus for vectors, and Elasticsearch for keywords. Then comes the “Glue Code Tax”: hundreds of lines of brittle Python just to sync data, keep systems consistent, and merge query results. The result is fragile, complex, and expensive to operate.

Enter OceanBase seekdb, an AI-native search database that unifies relational, vector, full-text, JSON, and GIS data in a single ACID-compliant, MySQL-compatible engine.

In this post, we will skip the architecture talk and go straight to code. You go from a blank editor to a fully operational, ACID compliant hybrid search engine running entirely in user space.

Core Advantages of seekdb

Before diving into the code, it helps to understand what you gain by replacing the stitched stack with a unified engine.

  • Hybrid Search: Stop choosing between accuracy and recall. seekdb combines Vector Search (semantic meaning) with Full-Text Search (keyword precision) and Relational Filtering in a single SQL query path.
  • True Multi-Model: seekdb treats vectors, text, JSON, and scalars as a unified whole. You can perform complex joins across relational tables and vector indices without moving data.
  • AI Inside: Move the "brain" closer to the data. seekdb includes built-in functions for embedding generation, reranking, and LLM inference directly within the database engine.
  • SQL + ACID: seekdb uses an OceanBase derived storage engine with full transactions and durability. It is compatible with native MySQL drivers and tools, and exposes a unfied SQL query language for multi model data. Writes are immediately queryable after commit with full ACID guarantees.
  • Flexible Deployment

    seekdb supports two deployment modes:

  • Embedded Mode: The engine runs in-process inside your Python app as a lightweight library. This fits local agents, tools, and prototypes.
  • Server Mode: The same client APIs can talk to a seekdb server instance, or to a full OceanBase cluster when you need high concurrency and distributed scale. Recommended for both testing and production.
  • In this guide, we use the Embedded mode to get you started. It is the fastest path to trying seekdb: no Docker, no sidecars, no network config—just Python.

    Deploy seekdb in Embeded Mode

    Before installation, ensure that your environment meets the following requirements:

  • Operating system: Linux (glibc >= 2.28)
  • Python version: Python 3.11 and later, with pip installed.
  • System architecture: x86_64, aarch64
  • Run the following command to install seekdb in the embeded mode:

    pip install -U pyseekdb
    Note:
    If your pip version is low, upgrade pip first as promopted before installing.

    This script demonstrates the “zero infrastructure” path. It initializes the database, ingests text with automatic embedding generation, and runs a semantic search.

    Create a file named hello_seekdb.py and execute it:

    import pyseekdb# ==================== Step 1: Create Client Connection ====================# Start in embedded mode (local SeekDB)client = pyseekdb.Client()# ==================== Step 2: Create a Collection with Embedding Function ====================# A collection is like a table that stores documents with vector embeddingscollection_name = "my_simple_collection"# Create collection with default embedding function# The embedding function will automatically convert documents to embeddingscollection = client.create_collection(    name=collection_name,)print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")print(f"Embedding function: {collection.embedding_function}")# ==================== Step 3: Add Data to Collection ====================# With embedding function, you can add documents directly without providing embeddings# The embedding function will automatically generate embeddings from documentsdocuments = [    "Machine learning is a subset of artificial intelligence",    "Python is a popular programming language",    "Vector databases enable semantic search",    "Neural networks are inspired by the human brain",    "Natural language processing helps computers understand text"]ids = ["id1", "id2", "id3", "id4", "id5"]# Add data with documents only - embeddings will be auto-generated by embedding functioncollection.add(    ids=ids,    documents=documents,  # embeddings will be automatically generated    metadatas=[        {"category": "AI", "index": 0},        {"category": "Programming", "index": 1},        {"category": "Database", "index": 2},        {"category": "AI", "index": 3},        {"category": "NLP", "index": 4}    ])print(f"\nAdded {len(documents)} documents to collection")print("Note: Embeddings were automatically generated from documents using the embedding function")# ==================== Step 4: Query the Collection ====================# The embedding function will automatically convert query text to query vector# Query using text - query vector will be auto-generated by embedding functionquery_text = "artificial intelligence and machine learning"results = collection.query(    query_texts=query_text,  # Query text - will be embedded automatically    n_results=3  # Return top 3 most similar documents)print(f"\nQuery: '{query_text}'")print(f"Query results: {len(results['ids'][0])} items found")# ==================== Step 5: Print Query Results ====================for i in range(len(results['ids'][0])):    print(f"\nResult {i+1}:")    print(f"  ID: {results['ids'][0][i]}")    print(f"  Distance: {results['distances'][0][i]:.4f}")    if results.get('documents'):        print(f"  Document: {results['documents'][0][i]}")    if results.get('metadatas'):        print(f"  Metadata: {results['metadatas'][0][i]}")# ==================== Step 6: Cleanup ====================# Delete the collectionclient.delete_collection(collection_name)print(f"\nDeleted collection '{collection_name}'")

    Sample output:

    >>> Creating collection: my_simple_collection    - Dimension: 384    - Embedding function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')>>> Ingesting data (Auto-Embedding)...    - Added 5 documents in 5.31s>>> Querying for: 'artificial intelligence and machine learning'    - Found 3 matches.Result 1:  ID: id1  Distance: 0.3008  Document: Machine learning is a subset of artificial intelligence  Metadata: {'index': 0, 'category': 'AI'}Result 2:  ID: id4  Distance: 0.5983  Document: Neural networks are inspired by the human brain  Metadata: {'index': 3, 'category': 'AI'}

    Wrap Up:

    You now have a working semantic search engine with zero infrastructure setup and just a few lines of logic. pyseekdbhandled the embeddings internally, so no external API key or separate model service was needed.

    Pure vector search is excellent at understanding intent but weak on exact keywords, numbers, and proper nouns. Full-text search has the opposite tradeoff. Hybrid search combines both signals and ranks the merged result set.

    seekdb exposes hybrid search in embedded mode through the hybrid_search() method. This uses the same embedded engine as Part 1—no extra services required.

    Create hybrid_seekdb.py and execute it:

    import pyseekdbfrom pprint import pprintimport timeprint(">>> Initializing SeekDB Client...")# Initialize Embedded Engineclient = pyseekdb.Client()collection_name = "quickstart_demo"# Idempotency: Clean up previous runtry:    client.delete_collection(collection_name)except:    passprint(f">>> Creating collection '{collection_name}'...")collection = client.create_collection(name=collection_name)documents = [    "Machine learning is a subset of artificial intelligence",    "Python is a popular programming language",    "Vector databases enable semantic search",    "Neural networks are inspired by the human brain",    "Natural language processing helps computers understand text",]print(">>> Indexing documents...")# The engine performs Vectorization + Inverted Index building herecollection.add(    ids=["id1", "id2", "id3", "id4", "id5"],    documents=documents,    metadatas=[        {"tag": "ai"}, {"tag": "code"}, {"tag": "db"}, {"tag": "ai"}, {"tag": "ai"},    ],)print(f"    - Indexed {len(documents)} documents.")# ---------------------------------------------------------# Hybrid Search: Vector('AI') + Keyword('learning') + RRF# ---------------------------------------------------------print("\n>>> Executing Hybrid Search...")print("    Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)")try:    hybrid_results = collection.hybrid_search(        query={            # Full text condition (lexical match)            # Note: The backslash before $ is for shell escaping; Python sees "\$contains"            "where_document": {"\$contains": "learning"},            "n_results": 5,        },        knn={            # Semantic condition (vector match)            "query_texts": ["artificial intelligence"],            "n_results": 5,        },        # Reciprocal Rank Fusion (RRF)        rank={"rrf": {}},        n_results=3,        include=["documents", "metadatas"],    )    print("\n--- RRF Re-ranked Results ---")    pprint(hybrid_results)except AttributeError:    print("\n[NOTE] Feature unavailable. Your current pyseekdb version or embedded binary")    print("       may not support 'hybrid_search'. This is a standard feature in OceanBase Server.")except Exception as e:    print(f"\n[ERROR] Hybrid search failed: {e}")# Cleanupclient.delete_collection(collection_name)print(f"\n>>> Cleanup complete.")EOF

    Sample output:

    ...- Indexed 5 documents.>>> Executing Hybrid Search...    Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)--- RRF Re-ranked Results ---{'distances': [[Decimal('0.0328'), Decimal('0.0161'), Decimal('0.0159')]], 'documents': [['Machine learning is a subset of artificial intelligence',                'Natural language processing helps computers understand text',                'Neural networks are inspired by the human brain']], 'ids': [['id1', 'id5', 'id4']], 'metadatas': [[{'tag': 'ai'}, {'tag': 'ai'}, {'tag': 'ai'}]]}

    What this script does:

  • The query block: Runs a full-text search on the document body, retrieving candidates that contain the specific word "learning."
  • The knn block: Runs a vector search using the collection’s embedding function on the concept "artificial intelligence."
  • The rank block: Fuses both rankings using Reciprocal Rank Fusion. This algorithm normalizes the scores from the text search and vector search to create a unified ranking without manual weighting.
  • n_results=3: Returns the top three hits after fusion.
  • Wrap-up:

    You now have a single embedded engine that handles:

  • Combined keyword plus semantic retrieval with hybrid_search().
  • Indexing, storage, and ranking—all inside one process, with no extra services and no glue code tax.
  • From Prototype to Production

    You’ve just experienced the core capabilities of OceanBase seekdb using its lightweight embedded mode. It allowed you to build a hybrid search engine in seconds without setting up a server.

    Note that the code you just wrote is compatible with the Server mode. When your application demands high concurrency, massive storage, or multi-tenant isolation, you can switch to the distributed server deployment without rewriting your query logic.

    Prototype: client = pyseekdb.Client()Production: client = pyseekdb.Client(host="...", port=2881)

    Ready to build?

  • Start Building: pip install pyseekdb
  • Follow the Repo: https://github.com/oceanbase/seekdb
  • Learn more in the doc: https://www.oceanbase.ai/docs/seekdb-overview

  • Share
    X
    linkedin
    mail