As AI applications scale, most teams hit the same wall: the Frankenstein stack.
To build a modern AI application, you are often forced to stitch together MySQL for metadata, Pinecone or Milvus for vectors, and Elasticsearch for keywords. Then comes the “Glue Code Tax”: hundreds of lines of brittle Python just to sync data, keep systems consistent, and merge query results. The result is fragile, complex, and expensive to operate.
Enter OceanBase seekdb, an AI-native search database that unifies relational, vector, full-text, JSON, and GIS data in a single ACID-compliant, MySQL-compatible engine.
In this post, we will skip the architecture talk and go straight to code. You go from a blank editor to a fully operational, ACID compliant hybrid search engine running entirely in user space.
Before diving into the code, it helps to understand what you gain by replacing the stitched stack with a unified engine.
seekdb supports two deployment modes:
In this guide, we use the Embedded mode to get you started. It is the fastest path to trying seekdb: no Docker, no sidecars, no network config—just Python.
Before installation, ensure that your environment meets the following requirements:
Run the following command to install seekdb in the embeded mode:
pip install -U pyseekdbNote:
If your pip version is low, upgrade pip first as promopted before installing.
This script demonstrates the “zero infrastructure” path. It initializes the database, ingests text with automatic embedding generation, and runs a semantic search.
Create a file named hello_seekdb.py and execute it:
import pyseekdb
# ==================== Step 1: Create Client Connection ====================
# Start in embedded mode (local SeekDB)
client = pyseekdb.Client()
# ==================== Step 2: Create a Collection with Embedding Function ====================
# A collection is like a table that stores documents with vector embeddings
collection_name = "my_simple_collection"
# Create collection with default embedding function
# The embedding function will automatically convert documents to embeddings
collection = client.create_collection(
name=collection_name,
)
print(f"Created collection '{collection_name}' with dimension: {collection.dimension}")
print(f"Embedding function: {collection.embedding_function}")
# ==================== Step 3: Add Data to Collection ====================
# With embedding function, you can add documents directly without providing embeddings
# The embedding function will automatically generate embeddings from documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text"
]
ids = ["id1", "id2", "id3", "id4", "id5"]
# Add data with documents only - embeddings will be auto-generated by embedding function
collection.add(
ids=ids,
documents=documents, # embeddings will be automatically generated
metadatas=[
{"category": "AI", "index": 0},
{"category": "Programming", "index": 1},
{"category": "Database", "index": 2},
{"category": "AI", "index": 3},
{"category": "NLP", "index": 4}
]
)
print(f"\nAdded {len(documents)} documents to collection")
print("Note: Embeddings were automatically generated from documents using the embedding function")
# ==================== Step 4: Query the Collection ====================
# The embedding function will automatically convert query text to query vector
# Query using text - query vector will be auto-generated by embedding function
query_text = "artificial intelligence and machine learning"
results = collection.query(
query_texts=query_text, # Query text - will be embedded automatically
n_results=3 # Return top 3 most similar documents
)
print(f"\nQuery: '{query_text}'")
print(f"Query results: {len(results['ids'][0])} items found")
# ==================== Step 5: Print Query Results ====================
for i in range(len(results['ids'][0])):
print(f"\nResult {i+1}:")
print(f" ID: {results['ids'][0][i]}")
print(f" Distance: {results['distances'][0][i]:.4f}")
if results.get('documents'):
print(f" Document: {results['documents'][0][i]}")
if results.get('metadatas'):
print(f" Metadata: {results['metadatas'][0][i]}")
# ==================== Step 6: Cleanup ====================
# Delete the collection
client.delete_collection(collection_name)
print(f"\nDeleted collection '{collection_name}'")>>> Creating collection: my_simple_collection
- Dimension: 384
- Embedding function: DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')
>>> Ingesting data (Auto-Embedding)...
- Added 5 documents in 5.31s
>>> Querying for: 'artificial intelligence and machine learning'
- Found 3 matches.
Result 1:
ID: id1
Distance: 0.3008
Document: Machine learning is a subset of artificial intelligence
Metadata: {'index': 0, 'category': 'AI'}
Result 2:
ID: id4
Distance: 0.5983
Document: Neural networks are inspired by the human brain
Metadata: {'index': 3, 'category': 'AI'}
Wrap Up:
You now have a working semantic search engine with zero infrastructure setup and just a few lines of logic. pyseekdbhandled the embeddings internally, so no external API key or separate model service was needed.
Pure vector search is excellent at understanding intent but weak on exact keywords, numbers, and proper nouns. Full-text search has the opposite tradeoff. Hybrid search combines both signals and ranks the merged result set.
seekdb exposes hybrid search in embedded mode through the hybrid_search() method. This uses the same embedded engine as Part 1—no extra services required.
Create hybrid_seekdb.py and execute it:
import pyseekdb
from pprint import pprint
import time
print(">>> Initializing SeekDB Client...")
# Initialize Embedded Engine
client = pyseekdb.Client()
collection_name = "quickstart_demo"
# Idempotency: Clean up previous run
try:
client.delete_collection(collection_name)
except:
pass
print(f">>> Creating collection '{collection_name}'...")
collection = client.create_collection(name=collection_name)
documents = [
"Machine learning is a subset of artificial intelligence",
"Python is a popular programming language",
"Vector databases enable semantic search",
"Neural networks are inspired by the human brain",
"Natural language processing helps computers understand text",
]
print(">>> Indexing documents...")
# The engine performs Vectorization + Inverted Index building here
collection.add(
ids=["id1", "id2", "id3", "id4", "id5"],
documents=documents,
metadatas=[
{"tag": "ai"}, {"tag": "code"}, {"tag": "db"}, {"tag": "ai"}, {"tag": "ai"},
],
)
print(f" - Indexed {len(documents)} documents.")
# ---------------------------------------------------------
# Hybrid Search: Vector('AI') + Keyword('learning') + RRF
# ---------------------------------------------------------
print("\n>>> Executing Hybrid Search...")
print(" Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)")
try:
hybrid_results = collection.hybrid_search(
query={
# Full text condition (lexical match)
# Note: The backslash before $ is for shell escaping; Python sees "\$contains"
"where_document": {"\$contains": "learning"},
"n_results": 5,
},
knn={
# Semantic condition (vector match)
"query_texts": ["artificial intelligence"],
"n_results": 5,
},
# Reciprocal Rank Fusion (RRF)
rank={"rrf": {}},
n_results=3,
include=["documents", "metadatas"],
)
print("\n--- RRF Re-ranked Results ---")
pprint(hybrid_results)
except AttributeError:
print("\n[NOTE] Feature unavailable. Your current pyseekdb version or embedded binary")
print(" may not support 'hybrid_search'. This is a standard feature in OceanBase Server.")
except Exception as e:
print(f"\n[ERROR] Hybrid search failed: {e}")
# Cleanup
client.delete_collection(collection_name)
print(f"\n>>> Cleanup complete.")
EOFSample output:
...
- Indexed 5 documents.
>>> Executing Hybrid Search...
Criteria: Near 'artificial intelligence' (Vector) AND contains 'learning' (Text)
--- RRF Re-ranked Results ---
{'distances': [[Decimal('0.0328'), Decimal('0.0161'), Decimal('0.0159')]],
'documents': [['Machine learning is a subset of artificial intelligence',
'Natural language processing helps computers understand text',
'Neural networks are inspired by the human brain']],
'ids': [['id1', 'id5', 'id4']],
'metadatas': [[{'tag': 'ai'}, {'tag': 'ai'}, {'tag': 'ai'}]]}What this script does:
Wrap-up:
You now have a single embedded engine that handles:
You’ve just experienced the core capabilities of OceanBase seekdb using its lightweight embedded mode. It allowed you to build a hybrid search engine in seconds without setting up a server.
Note that the code you just wrote is compatible with the Server mode. When your application demands high concurrency, massive storage, or multi-tenant isolation, you can switch to the distributed server deployment without rewriting your query logic.
Prototype: client = pyseekdb.Client()
Production: client = pyseekdb.Client(host="...", port=2881)Ready to build?