Stop Stitching, Start Building: Get Started with OceanBase seekdb

右侧logo

As AI applications scale, most teams hit the same wall: the Frankenstein stack.

To build a modern AI application, you are often forced to stitch together MySQL for metadata, Pinecone or Milvus for vectors, and Elasticsearch for keywords. Then comes the “Glue Code Tax”: hundreds of lines of brittle Python just to sync data, keep systems consistent, and merge query results. The result is fragile, complex, and expensive to operate.

Enter OceanBase seekdb, an AI-native search database that unifies relational, vector, full-text, JSON, and GIS data in a single ACID-compliant, MySQL-compatible engine.

In this post, we will skip the architecture talk and go straight to code. You go from a blank editor to a fully operational, ACID compliant hybrid search engine running entirely in user space.

Core Advantages of seekdb

Before diving into the code, it helps to understand what you gain by replacing the stitched stack with a unified engine.

  • Hybrid Search: Stop choosing between accuracy and recall. seekdb combines Vector Search (semantic meaning) with Full-Text Search (keyword precision) and Relational Filtering in a single SQL query path.
  • True Multi-Model: seekdb treats vectors, text, JSON, and scalars as a unified whole. You can perform complex joins across relational tables and vector indices without moving data.
  • AI Inside: Move the "brain" closer to the data. seekdb includes built-in functions for embedding generation, reranking, and LLM inference directly within the database engine.
  • SQL + ACID: seekdb uses an OceanBase derived storage engine with full transactions and durability. It is compatible with native MySQL drivers and tools, and exposes a unfied SQL query language for multi model data. Writes are immediately queryable after commit with full ACID guarantees.

Flexible Deployment

seekdb supports two deployment modes:

  • Embedded Mode: The engine runs in-process inside your Python app as a lightweight library. This fits local agents, tools, and prototypes.
  • Server Mode: The same client APIs can talk to a seekdb server instance, or to a full OceanBase cluster when you need high concurrency and distributed scale. Recommended for both testing and production.

In this guide, we use the Embedded mode to get you started. It is the fastest path to trying seekdb: no Docker, no sidecars, no network config—just Python.

Deploy seekdb in Embeded Mode

Before installation, ensure that your environment meets the following requirements:

  • Operating system: Linux (glibc >= 2.28)
  • Python version: Python 3.11 and later, with pip installed.
  • System architecture: x86_64, aarch64

Run the following command to install seekdb in the embeded mode:

pip install -U pyseekdb
Note:If your pip version is low, upgrade pip first as promopted before installing.

This script demonstrates the “zero infrastructure” path. It initializes the database, ingests text with automatic embedding generation, and runs a semantic search.

Create a file named hello_seekdb.pyand execute it:

import pyseekdb

# ==================== Step 1: Create Client Connection ====================
# Start in embedded mode (local SeekDB)
client = pyseekdb.Client()

# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"

# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
    name=collection_name,
)

print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")

# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents

documents = [
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language",
    "Vector databases enable semantic search",
    "Neural networks are inspired by the human brain",
    "Natural language processing helps computers understand text"
]

ids = ["id1", "id2", "id3", "id4", "id5"]

# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
    ids=ids,
    documents=documents,  # embeddings will be automatically generated
    metadatas=[
        {"category": "AI", "index": 0},
        {"category": "Programming", "index": 1},
        {"category": "Database", "index": 2},
        {"category": "AI", "index": 3},
        {"category": "NLP", "index": 4}
    ]
)

print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")

# ==================== Step 4: Query the Collection ====================
# The embedding function will automatically convert query text to query vector

# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"

results = collection.query(
    query_texts=query_text,  # Query text - will be embedded automatically
    n_results=3  # Return top 3 most similar documents
)

print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")

# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
    print(f"\nResult {i+1}:")
    print(f"  ID: {results['ids'][0][i]}")
    print(f"  Distance: {results['distances'][0][i]:.4f}")
    if results.get('documents'):
        print(f"  Document: {results['documents'][0][i]}")
    if results.get('metadatas'):
        print(f"  Metadata: {results['metadatas'][0][i]}")

# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")

Sample output:

>>> Creating collection: my_simple_collection
    - Dimension: 384
    - Embedding function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')
>>> Ingesting data (Auto-Embedding)...
    - Added 5 documents in 5.31s

>>> Querying for: 'artificial intelligence and machine learning'
    - Found 3 matches.

Result 1:
  ID: id1
  Distance: 0.3008
  Document: Machine learning is a subset of artificial intelligence
  Metadata: {'index': 0, 'category': 'AI'}

Result 2:
  ID: id4
  Distance: 0.5983
  Document: Neural networks are inspired by the human brain
  Metadata: {'index': 3, 'category': 'AI'}

Wrap Up:

You now have a working semantic search engine with zero infrastructure setup and just a few lines of logic. pyseekdbhandled the embeddings internally, so no external API key or separate model service was needed.

Pure vector search is excellent at understanding intent but weak on exact keywords, numbers, and proper nouns. Full-text search has the opposite tradeoff. Hybrid search combines both signals and ranks the merged result set.


ICON_SHARE
ICON_SHARE
linkedin
Contact Us