Meet OceanBase AI Database, the unified database for operational data, real-time analytics, and AI. Explore ->

Start on Cloud

OceanBase

A unified distributed database ready for your transactional, analytical, and AI workloads.

Product Overview

DEPLOY YOUR WAY

OceanBase Cloud

The best way to deploy and scale OceanBase

OceanBase Enterprise

Run and manage OceanBase on your infra

TRY OPEN SOURCE

OceanBase Community Edition

The free, open-source distributed database

OceanBase seekdb

Open source AI native search database

Customer Stories

Real-world success stories from enterprises across diverse industries.

View All

BY USE CASES

Mission-Critical Transactions

Global & Multicloud Application

Elastic Scaling for Peak Traffic

Real-time Analytics

Active Geo-redundancy

Database Consolidation

Comprehensive knowledge hub for OceanBase.

Blog

Live Demos

Training & Certification

Documentation

Official technical guides, tutorials, API references, and manuals for all OceanBase products.

View All

PRODUCTS

OceanBase Cloud

OceanBase Database

Tools Connectors and Middleware

QUICK START

OceanBase Cloud OceanBase Database

BEST PRACTICES

Practical guides for utilizing OceanBase more effectively and conveniently

Learn more about OceanBase – our company, partnerships, and trust and security initiatives.

About OceanBase

Partner

Trust Center

Back to Blog

Why a new AI-native database trended on GitHub

Mike Liu

Published on June 12, 2026Updated on 2026-07-22

10 minute read

On this page

Why Agents Need a Different Database

Hybrid Search in One SQL

Streaming-First Indexing: 1,523 QPS Under Continuous Writes

Fork/Merge Sandboxes for Safe Agent Exploration

Putting It Together: A Database for the Agent DDD Era

About seekdb

Key Takeaways

seekdb is a MySQL-compatible state store for AI agents — hybrid vector, full-text, and scalar search in a single SQL query, with Git-like data branching built into the kernel.
Under a third-party streaming benchmark (VectorDBBench), seekdb sustains 1,523 QPS on streaming write+search — 10× Milvus, 3× Elasticsearch — with P99 jitter of just 1.1× under concurrency.
Fork/Diff/Merge primitives let agents safely explore in Copy-on-Write sandboxes, then merge results back with ACID guarantees.

seekdb hit GitHub's C++ trending list today. It was also recently recommended by the GitHub Projects Community on X. We started seeing developers in the comments asking about hybrid search, streaming QPS, and the Fork/Merge sandbox — so it felt like a good time to write up what's actually under the hood.

Here's what seekdb is, and the architecture decisions behind it.

seekdb is a MySQL-compatible state store purpose-built for AI agents. It combines vector search, full-text search, and scalar filtering in a single SQL query — no stitching across systems. It ships with kernel-level Copy-on-Write sandboxes (Fork → Diff → Merge) for safe agent exploration, full ACID transactions, and works as both an embedded library and a standalone server. LangChain, LlamaIndex, Dify, and Coze connect out of the box via MySQL protocol.

But the feature list doesn't explain why developers are interested. The workload does.

Why Agents Need a Different Database

An AI agent isn't a human user running queries one at a time. It's a loop — observing, reasoning, acting, writing results back, and immediately retrieving again for the next step:

for step in agent.run():
    memory.write(step.observation)        # continuous writes
    relevant = memory.search(step.query)  # millisecond-later reads

This loop creates a workload that's fundamentally different from traditional database usage. Writes and reads are interleaved at millisecond intervals. Retrievals are multi-modal — combining semantic similarity, keyword matching, and structured filters in a single request. And the agent often needs to explore speculatively: try an action, inspect the result, and decide whether to keep it or throw it away.

These characteristics translate into three concrete requirements for the database underneath:

Hybrid retrieval in one round trip. Agent queries are rarely pure vector similarity — they combine embedding distance with structured filters and keyword matching. If that requires calling three systems and merging client-side, you're adding latency at exactly the wrong layer.
Streaming writes that don't blow up P99. The agent writes continuously. If each write batch creates a new index segment, query fanout grows unboundedly. Under concurrency, P99 latency doesn't degrade gracefully — it explodes.
A sandbox for speculative changes. An agent might update memory with a hypothesis it isn't sure about, run an A/B test on its own state, or execute a tool call that could write garbage. You need a way to branch, inspect, and selectively merge — most vector databases have no primitive for this.

These three requirements shaped the architecture of seekdb

Hybrid Search in One SQL

The most common question we saw in the comments: how does hybrid search actually work here?

Agent retrieval is rarely pure vector similarity. A typical query looks like: show me the top 10 documents authored by user 42 since January, matching "quarterly report", ranked by embedding similarity.

In most architectures, that requires three round trips — one to the vector index, one to the full-text engine, one to the relational store — then client-side merging and re-ranking. Each hop adds latency and each system has its own consistency model.

In seekdb, that's one SQL statement:

SELECT id, title, l2_distance(emb, '[0.12,0.34,...]') AS dist
FROM docs
WHERE MATCH(content) AGAINST ('quarterly report')
  AND author_id = 42
  AND created_at > '2026-01-01'
ORDER BY dist APPROXIMATE LIMIT 10;

Vector distance, full-text matching, and scalar filters are pushed down into a single execution plan. No client-side merging, no multiple round trips, no consistency gaps between systems.

This works because seekdb isn't a vector database with relational features bolted on, or a relational database with a vector plugin. The storage engine natively maintains vector indexes (HNSW, IVF), inverted indexes (full-text with BM25 ranking, CJK tokenizers), and B-tree indexes (scalar) — and the query optimizer can combine them in a single plan. DML operations update all index types transactionally, so query results are always consistent with the latest committed state.

Why does this matter for agents specifically? Because an agent's retrieval context is almost never a single modality. When a coding agent searches its memory for "the authentication module we discussed last Tuesday," that's simultaneously a semantic query (authentication concept), a keyword query ("authentication module"), and a temporal filter (last Tuesday). Forcing the agent framework to orchestrate three separate calls, deduplicate, and re-rank adds 50–200ms of overhead per retrieval — overhead that compounds across the dozens of retrieval calls in a single agent run.

With seekdb, the agent framework issues one SQL query. The database handles the rest internally, returning a single ranked result set in one network round trip.

seekdb speaks the MySQL wire protocol natively. LangChain, LlamaIndex, Dify, Coze, and any MySQL client connect without an adapter or custom driver. Your existing MySQL tooling — ORMs, migration scripts, monitoring — works unchanged. If your agent framework can talk to MySQL, it can talk to seekdb.

Streaming-First Indexing: 1,523 QPS Under Continuous Writes

The second thing people noticed in the README: streaming benchmark numbers. Here's what's behind them.

Traditional vector databases — Milvus, Elasticsearch, Qdrant — perform well in the workloads they were designed for: bulk ingestion followed by read-only queries. They are good at that shape.

But streaming writes expose a structural assumption baked into all of them: every batch of new data produces a new index segment. At query time, the engine fans the request out to N segments, runs a k-NN search against each one, and merges the results. With a single query thread, that's manageable. But once you run M concurrent query threads against N segments, you have N×M units of work contending for CPU.

P99 doesn't degrade gracefully. It explodes.

This isn't a theoretical concern. We hit it ourselves. seekdb v1.2.0 used a conventional approach — synchronous index building on the write path, growing segment count over time. Under the same streaming benchmark we'll show below, it managed 69 QPS with a concurrent P99 of 410ms. That was unacceptable for agent workloads where retrieval latency directly gates the agent's next reasoning step.

What we changed

seekdb v1.3.0 introduced two mechanisms specifically for streaming workloads:

The write path never touches the index. When a transaction commits, all that happens synchronously is a write to the redo log. A separate Change Stream pipeline asynchronously consumes the log in the background and applies vectors to an in-memory delta HNSW index. Writes and index construction are physically decoupled — writes never block on indexing, and indexing never blocks on writes.

The query path always hits exactly two indexes. seekdb maintains a delta HNSW (the incremental layer absorbing new writes) and a snapshot HNSW (the steady-state main index), modeled after the LSM-tree pattern from KV stores. A query runs k-NN against both and merges the result. The number of indexes is fixed regardless of how much data has been written — so concurrent queries don't contend on a growing fanout.

Benchmark evidence

We tested this against five other vector databases using VectorDBBench's StreamingPerformanceCase — a third-party open-source benchmark maintained by Zilliz (the company behind Milvus). We used it specifically because it isn't something we built to make ourselves look good.

Setup: Cohere 10M dataset (768-dim), 16 vCPU / 64 GiB, identical HNSW parameters across all systems (M=16, ef_construction=256, ef_search=200), sustained write rate of 500 rows/sec.

The metric that matters isn't raw QPS or serial latency. It's how much your P99 moves when you add concurrency — because your agent doesn't run single-threaded in production.

Database	QPS	Serial P99	Concurrent P99	P99 Jitter
seekdb v1.3.0	1,523	19.7 ms	21.7 ms	1.1×
Milvus	153	15.9 ms	153.6 ms	9.7×
Elasticsearch	487	5.2 ms	53.6 ms	10.3×

Elasticsearch actually has a faster serial P99 than seekdb. But the moment you add concurrency — which is what production looks like — it climbs 10× while seekdb barely moves.

That's the v1.2.0 → v1.3.0 delta we mentioned earlier: from 69 QPS / 410ms concurrent P99 to 1,523 QPS / 21.7ms. 22× QPS, 19× P99 — same hardware, same dataset, purely an architectural change.

Full benchmark scripts and configs are reproducible: github.com/oceanbase/vdb-streambench. PRs welcome to add more systems.

Fork/Merge Sandboxes for Safe Agent Exploration

Performance is one half of the agent problem. The other half is something benchmarks don't even try to measure: agents need to make speculative changes to their data, and they need a clean way to roll back.

Consider a coding agent that writes business logic and produces a 500-row result table. How do you verify it matches the expected output? You could write LEFT JOIN + CASE WHEN + UNION ALL, handle NULLs, fix the sort order, copy-paste the diff to the agent, wait for it to fix the code, then run the whole comparison again.

Code has git diff — one command. Data had nothing equivalent. Until now.

Fork Database

seekdb implements Copy-on-Write directly in the storage engine. FORK DATABASE creates an instant, full-database clone at a single atomic snapshot point — all tables share the same snapshot version, so foreign keys and multi-table joins remain consistent. The operation completes in seconds regardless of data size (1 GB or 100 GB) because nothing is physically copied until a write occurs.

This works in both embedded mode (your agent runs seekdb in-process) and server mode (shared instance). Either way, forking is instant and each sandbox is a fully writable database with its own vector indexes, schemas, and auto-increment state.

Diff Table

One SQL statement to see exactly what changed:

DIFF TABLE sandbox.result AGAINST production.result;

Output tells you precisely: conflict rows (same primary key, different values), rows unique to each side (missing or extra), with the actual values. No ambiguity, no hand-written JOINs.

Merge Table

Three conflict resolution strategies in a single transactional operation:

Strategy	On Conflict	Use Case
FAIL	Abort and rollback	Strict audit — no surprises
THEIRS	Overwrite with branch data	Trust the agent's work
OURS	Keep mainline, only add new rows	Conservative merge

The full cycle

-- Agent gets an isolated sandbox (instant, no data copy)
FORK DATABASE agent_state TO sandbox_42;

-- Agent does whatever it wants — writes, updates, deletes
USE sandbox_42;
INSERT INTO memory (embedding, content)
VALUES ('[0.1,...]', 'new observation');

-- Review what changed
DIFF TABLE sandbox_42.memory AGAINST agent_state.memory;

-- Speculation succeeded → merge back
MERGE TABLE sandbox_42.memory INTO agent_state.memory
      STRATEGY THEIRS;

-- Speculation failed → drop it, mainline untouched
DROP DATABASE sandbox_42;

This maps directly to Git: Fork is git branch, Diff is git diff, Merge is git merge. The difference is that it operates on structured data with full ACID guarantees, not text files.

Why this matters for agent development loops

With DIFF TABLE, verifying an agent's output becomes a one-liner instead of 30+ lines of hand-crafted SQL. No output means the agent passed; any output pinpoints exactly which rows differ and how.

This turns data verification from a manual bottleneck into something that fits inside an automated test loop. The agent writes code → runs it → DIFFs the result against expected → if differences exist, feeds them back as context for the next iteration. The entire write-test-debug cycle becomes programmatic.

The same primitives solve problems beyond agents: multi-team parallel development (each team forks, works independently, merges back), A/B experiment analysis (DIFF the experiment group against control — difference rows carry dimension columns for instant segmentation), regression testing (DIFF expected vs. actual — no output means pass, any output pinpoints the failure).

Putting It Together: A Database for the Agent DDD Era

A new generation of agent development frameworks — often called Agent-Driven Development (Agent DDD) — is letting AI agents autonomously produce code while developers define goals. This changes what a database needs to do.

In traditional development, a human writes code, manually prepares test data, and eyeballs the result. In Agent DDD, dozens of agents work in parallel, each generating code, running it against data, and iterating autonomously. The database is no longer a passive store that humans query occasionally — it's an active participant in every agent's reasoning loop. That means:

Every reasoning step issues a retrieval — and if that retrieval requires stitching three systems together, the latency tax compounds across hundreds of iterations per agent run. Hybrid Search in one SQL eliminates that overhead.
Agents write continuously as they observe and act — if the index can't keep up without blowing P99, the agent's feedback loop slows to a crawl. Streaming-First Indexing keeps latency predictable under concurrent write+read.
When 20+ agents develop features in parallel, they each need an isolated environment to run speculative logic without polluting each other's state — and a way to verify results and merge back. Fork/Diff/Merge provides that Git-like data workflow at the kernel level.

Together, these three capabilities shift agent-era data development from serial and manual to parallel and programmatic. The write-code → auto-evaluate → diff-feedback → iterate loop that Agent DDD demands now has native database infrastructure behind it.

If you're adopting next-generation agent frameworks and want your data layer to keep up, give seekdb a look — and let us know what you're building.

About seekdb

seekdb is fully open source (Apache 2.0), developed by the OceanBase team. You may already be using OceanBase — it runs in production at Alipay, Taobao, DiDi, Xiaomi, and more. seekdb inherits the same storage engine and SQL executor, focused on the vector + relational hybrid workload for Agent scenarios — with 2,500+ GitHub stars since launch and integrations with LangChain / LlamaIndex / Dify / Coze and other major frameworks.

If you're choosing a database for your Agent — take 30 seconds and try it.

⭐ github.com/oceanbase/seekdb — a star helps more people discover this project and motivates us to keep investing in it.

Questions or want to discuss your Agent use case: GitHub Issues · GitHub Discussions

Ask AI

Content

Why Agents Need a Different Database

Hybrid Search in One SQL

Streaming-First Indexing: 1,523 QPS Under Continuous Writes

Fork/Merge Sandboxes for Safe Agent Exploration

Putting It Together: A Database for the Agent DDD Era

About seekdb

Keep Reading

View all posts

PRODUCT

Exploring OceanBase 4.3: New Features and Enhancements

At the OceanBase DevCon 2024, we introduced the OceanBase 4.3.0 Beta, unveiling a brand new columnar engine. This release achieves near petabyte-scale, real-time analytics in seconds, and enhances the integration of TP and AP capabilities.

Ray YuJune 13, 2024

PRODUCT

How seekdb M0 Gives OpenClaw Persistent Memory and Shared Experience

OpenClaw's memory degrades over time—an architectural limitation, not a configuration issue. seekdb M0 solves this with cloud-based memory that persists across sessions and shares learned experience across agents.

Rongfeng FuApril 3, 2026

PRODUCT

OceanBase DataStudio: From Stitched Pipelines to Unified AI Data Production

OceanBase DataStudio unifies data ingestion, processing, governance, and serving for AI training data — replacing multi-system pipelines with a single lakebase-integrated workbench.

Mingqiang ZhuangJuly 21, 2026