seekdb hit GitHub's C++ trending list today. It was also recently recommended by the GitHub Projects Community on X. We started seeing developers in the comments asking about hybrid search, streaming QPS, and the Fork/Merge sandbox — so it felt like a good time to write up what's actually under the hood.
Here's what seekdb is, and the architecture decisions behind it.
seekdb is a MySQL-compatible state store purpose-built for AI agents. It combines vector search, full-text search, and scalar filtering in a single SQL query — no stitching across systems. It ships with kernel-level Copy-on-Write sandboxes (Fork → Diff → Merge) for safe agent exploration, full ACID transactions, and works as both an embedded library and a standalone server. LangChain, LlamaIndex, Dify, and Coze connect out of the box via MySQL protocol.
But the feature list doesn't explain why developers are interested. The workload does.
An AI agent isn't a human user running queries one at a time. It's a loop — observing, reasoning, acting, writing results back, and immediately retrieving again for the next step:
for step in agent.run():
memory.write(step.observation) # continuous writes
relevant = memory.search(step.query) # millisecond-later readsThis loop creates a workload that's fundamentally different from traditional database usage. Writes and reads are interleaved at millisecond intervals. Retrievals are multi-modal — combining semantic similarity, keyword matching, and structured filters in a single request. And the agent often needs to explore speculatively: try an action, inspect the result, and decide whether to keep it or throw it away.
These characteristics translate into three concrete requirements for the database underneath:
These three requirements shaped the architecture of seekdb
The most common question we saw in the comments: how does hybrid search actually work here?
Agent retrieval is rarely pure vector similarity. A typical query looks like: show me the top 10 documents authored by user 42 since January, matching "quarterly report", ranked by embedding similarity.
In most architectures, that requires three round trips — one to the vector index, one to the full-text engine, one to the relational store — then client-side merging and re-ranking. Each hop adds latency and each system has its own consistency model.
In seekdb, that's one SQL statement:
SELECT id, title, l2_distance(emb, '[0.12,0.34,...]') AS dist
FROM docs
WHERE MATCH(content) AGAINST ('quarterly report')
AND author_id = 42
AND created_at > '2026-01-01'
ORDER BY dist APPROXIMATE LIMIT 10;Vector distance, full-text matching, and scalar filters are pushed down into a single execution plan. No client-side merging, no multiple round trips, no consistency gaps between systems.
This works because seekdb isn't a vector database with relational features bolted on, or a relational database with a vector plugin. The storage engine natively maintains vector indexes (HNSW, IVF), inverted indexes (full-text with BM25 ranking, CJK tokenizers), and B-tree indexes (scalar) — and the query optimizer can combine them in a single plan. DML operations update all index types transactionally, so query results are always consistent with the latest committed state.
Why does this matter for agents specifically? Because an agent's retrieval context is almost never a single modality. When a coding agent searches its memory for "the authentication module we discussed last Tuesday," that's simultaneously a semantic query (authentication concept), a keyword query ("authentication module"), and a temporal filter (last Tuesday). Forcing the agent framework to orchestrate three separate calls, deduplicate, and re-rank adds 50–200ms of overhead per retrieval — overhead that compounds across the dozens of retrieval calls in a single agent run.
With seekdb, the agent framework issues one SQL query. The database handles the rest internally, returning a single ranked result set in one network round trip.
seekdb speaks the MySQL wire protocol natively. LangChain, LlamaIndex, Dify, Coze, and any MySQL client connect without an adapter or custom driver. Your existing MySQL tooling — ORMs, migration scripts, monitoring — works unchanged. If your agent framework can talk to MySQL, it can talk to seekdb.
The second thing people noticed in the README: streaming benchmark numbers. Here's what's behind them.
Traditional vector databases — Milvus, Elasticsearch, Qdrant — perform well in the workloads they were designed for: bulk ingestion followed by read-only queries. They are good at that shape.
But streaming writes expose a structural assumption baked into all of them: every batch of new data produces a new index segment. At query time, the engine fans the request out to N segments, runs a k-NN search against each one, and merges the results. With a single query thread, that's manageable. But once you run M concurrent query threads against N segments, you have N×M units of work contending for CPU.
P99 doesn't degrade gracefully. It explodes.
This isn't a theoretical concern. We hit it ourselves. seekdb v1.2.0 used a conventional approach — synchronous index building on the write path, growing segment count over time. Under the same streaming benchmark we'll show below, it managed 69 QPS with a concurrent P99 of 410ms. That was unacceptable for agent workloads where retrieval latency directly gates the agent's next reasoning step.
seekdb v1.3.0 introduced two mechanisms specifically for streaming workloads:
The write path never touches the index. When a transaction commits, all that happens synchronously is a write to the redo log. A separate Change Stream pipeline asynchronously consumes the log in the background and applies vectors to an in-memory delta HNSW index. Writes and index construction are physically decoupled — writes never block on indexing, and indexing never blocks on writes.
The query path always hits exactly two indexes. seekdb maintains a delta HNSW (the incremental layer absorbing new writes) and a snapshot HNSW (the steady-state main index), modeled after the LSM-tree pattern from KV stores. A query runs k-NN against both and merges the result. The number of indexes is fixed regardless of how much data has been written — so concurrent queries don't contend on a growing fanout.
We tested this against five other vector databases using VectorDBBench's StreamingPerformanceCase — a third-party open-source benchmark maintained by Zilliz (the company behind Milvus). We used it specifically because it isn't something we built to make ourselves look good.
Setup: Cohere 10M dataset (768-dim), 16 vCPU / 64 GiB, identical HNSW parameters across all systems (M=16, ef_construction=256, ef_search=200), sustained write rate of 500 rows/sec.
The metric that matters isn't raw QPS or serial latency. It's how much your P99 moves when you add concurrency — because your agent doesn't run single-threaded in production.
| Database | QPS | Serial P99 | Concurrent P99 | P99 Jitter |
| seekdb v1.3.0 | 1,523 | 19.7 ms | 21.7 ms | 1.1× |
| Milvus | 153 | 15.9 ms | 153.6 ms | 9.7× |
| Elasticsearch | 487 | 5.2 ms | 53.6 ms | 10.3× |
Elasticsearch actually has a faster serial P99 than seekdb. But the moment you add concurrency — which is what production looks like — it climbs 10× while seekdb barely moves.
That's the v1.2.0 → v1.3.0 delta we mentioned earlier: from 69 QPS / 410ms concurrent P99 to 1,523 QPS / 21.7ms. 22× QPS, 19× P99 — same hardware, same dataset, purely an architectural change.
Full benchmark scripts and configs are reproducible: github.com/oceanbase/vdb-streambench. PRs welcome to add more systems.
Performance is one half of the agent problem. The other half is something benchmarks don't even try to measure: agents need to make speculative changes to their data, and they need a clean way to roll back.
Consider a coding agent that writes business logic and produces a 500-row result table. How do you verify it matches the expected output? You could write LEFT JOIN + CASE WHEN + UNION ALL, handle NULLs, fix the sort order, copy-paste the diff to the agent, wait for it to fix the code, then run the whole comparison again.
Code has git diff — one command. Data had nothing equivalent. Until now.
seekdb implements Copy-on-Write directly in the storage engine. FORK DATABASE creates an instant, full-database clone at a single atomic snapshot point — all tables share the same snapshot version, so foreign keys and multi-table joins remain consistent. The operation completes in seconds regardless of data size (1 GB or 100 GB) because nothing is physically copied until a write occurs.
This works in both embedded mode (your agent runs seekdb in-process) and server mode (shared instance). Either way, forking is instant and each sandbox is a fully writable database with its own vector indexes, schemas, and auto-increment state.
One SQL statement to see exactly what changed:
DIFF TABLE sandbox.result AGAINST production.result;Output tells you precisely: conflict rows (same primary key, different values), rows unique to each side (missing or extra), with the actual values. No ambiguity, no hand-written JOINs.
Three conflict resolution strategies in a single transactional operation:
| Strategy | On Conflict | Use Case |
| FAIL | Abort and rollback | Strict audit — no surprises |
| THEIRS | Overwrite with branch data | Trust the agent's work |
| OURS | Keep mainline, only add new rows | Conservative merge |
-- Agent gets an isolated sandbox (instant, no data copy)
FORK DATABASE agent_state TO sandbox_42;
-- Agent does whatever it wants — writes, updates, deletes
USE sandbox_42;
INSERT INTO memory (embedding, content)
VALUES ('[0.1,...]', 'new observation');
-- Review what changed
DIFF TABLE sandbox_42.memory AGAINST agent_state.memory;
-- Speculation succeeded → merge back
MERGE TABLE sandbox_42.memory INTO agent_state.memory
STRATEGY THEIRS;
-- Speculation failed → drop it, mainline untouched
DROP DATABASE sandbox_42;This maps directly to Git: Fork is git branch, Diff is git diff, Merge is git merge. The difference is that it operates on structured data with full ACID guarantees, not text files.
With DIFF TABLE, verifying an agent's output becomes a one-liner instead of 30+ lines of hand-crafted SQL. No output means the agent passed; any output pinpoints exactly which rows differ and how.
This turns data verification from a manual bottleneck into something that fits inside an automated test loop. The agent writes code → runs it → DIFFs the result against expected → if differences exist, feeds them back as context for the next iteration. The entire write-test-debug cycle becomes programmatic.
The same primitives solve problems beyond agents: multi-team parallel development (each team forks, works independently, merges back), A/B experiment analysis (DIFF the experiment group against control — difference rows carry dimension columns for instant segmentation), regression testing (DIFF expected vs. actual — no output means pass, any output pinpoints the failure).
A new generation of agent development frameworks — often called Agent-Driven Development (Agent DDD) — is letting AI agents autonomously produce code while developers define goals. This changes what a database needs to do.
In traditional development, a human writes code, manually prepares test data, and eyeballs the result. In Agent DDD, dozens of agents work in parallel, each generating code, running it against data, and iterating autonomously. The database is no longer a passive store that humans query occasionally — it's an active participant in every agent's reasoning loop. That means:
Together, these three capabilities shift agent-era data development from serial and manual to parallel and programmatic. The write-code → auto-evaluate → diff-feedback → iterate loop that Agent DDD demands now has native database infrastructure behind it.
If you're adopting next-generation agent frameworks and want your data layer to keep up, give seekdb a look — and let us know what you're building.
seekdb is fully open source (Apache 2.0), developed by the OceanBase team. You may already be using OceanBase — it runs in production at Alipay, Taobao, DiDi, Xiaomi, and more. seekdb inherits the same storage engine and SQL executor, focused on the vector + relational hybrid workload for Agent scenarios — with 2,500+ GitHub stars since launch and integrations with LangChain / LlamaIndex / Dify / Coze and other major frameworks.
If you're choosing a database for your Agent — take 30 seconds and try it.
⭐ github.com/oceanbase/seekdb — a star helps more people discover this project and motivates us to keep investing in it.
Questions or want to discuss your Agent use case: GitHub Issues · GitHub Discussions

At the OceanBase DevCon 2024, we introduced the OceanBase 4.3.0 Beta, unveiling a brand new columnar engine. This release achieves near petabyte-scale, real-time analytics in seconds, and enhances the integration of TP and AP capabilities.


OpenClaw's memory degrades over time—an architectural limitation, not a configuration issue. seekdb M0 solves this with cloud-based memory that persists across sessions and shares learned experience across agents.


Learns how locality in OceanBase turns a DR topology into something the cluster enforces - replica counts, replica types (F/R/C), and zone placement.
