It's 3 a.m. A node in your database cluster goes dark. The on-call engineer has two questions, in this order: Did we lose any committed transactions? And how long until the system is back?
For a distributed database, the honest answer to both questions depends on one thing: how the system reaches agreement when machines fail. OceanBase's answer is Multi-Paxos — a consensus protocol that, combined with log streams, replica types, and a lease-based election mechanism, lets the database guarantee zero data loss (RPO = 0) and recover in under 8 seconds (RTO < 8s) without human intervention.
This post walks through how those guarantees are actually built — starting from the consensus problem, then the protocol choices, the replica topology, and finally the failover path.
Before diving in, here's the layered design this post will unpack:
| Layer | Mechanism | What It Buys You |
| Consensus | Multi-Paxos with quorum commit | RPO = 0 against single-node and single-DC failure |
| Replication unit | Log streams (multi-partition shared replication) | Lower replication overhead at scale |
| Replica topology | Full + read-only + columnstore + arbitration | Read scalability and cost-efficient cross-DC DR |
| Failover | Lease-based election at log-stream granularity | RTO < 8s, partial failure stays partial |
| Client routing | OBProxy with leader feedback | Failover transparent to applications |
Each layer below addresses a specific failure mode — we'll work through them in order.
Paxos solves a deceptively simple problem: a group of machines need to agree on a value, even when some of them are slow, unreachable, or returning wrong answers. A useful mental model is a roomful of people trying to agree on one outcome — except some people may arrive late, go offline, or even give incorrect responses. The system still has to reach one decision, and that decision must not be overturned later.
That is the problem distributed consensus is designed to solve. The Paxos consensus protocol, proposed by Turing Award winner Leslie Lamport, is a classic solution to this problem. Its core idea is simple but powerful: a decision is considered valid only after it has been accepted by a majority of participants, also known as a quorum.
Why a majority? Because mathematically, any two majorities must have at least one member in common. In other words, the system cannot end up in a permanent split where "one half says A" and "the other half says B." This prevents split-brain at the root — a dangerous scenario in which two replicas both believe they are the leader and independently accept write requests at the same time.
In OceanBase, data is replicated and distributed across different physical machines, or nodes, in the form of replicas. Database modifications generate commit logs. These logs must not only be written to the current leader replica, but also be durably persisted on a majority of replicas before a successful commit response can be returned to the client.
So even if a single disk fails, or even if the physical node where the leader replica resides goes down, as long as a majority of replicas are still alive on other machines, the data has already been accepted by a quorum and persisted on more than one machine. It will not be lost because of a single-machine failure.
This is an important reason why OceanBase can achieve zero data loss in disaster recovery scenarios — that is, Recovery Point Objective (RPO) = 0. Its reliability is not built on the assumption that one special machine will never fail. Instead, it is built on the majority mechanism: once a change is accepted by a quorum, it becomes part of the durable history of the system.
There's a scaling problem, though. If every partition ran its own independent Paxos instance, a busy cluster would generate thousands of separate log synchronizations and quorum acknowledgments per second — each with its own RPC round trip. The network and coordination overhead would grow quickly.
OceanBase collapses that traffic with log streams — an abstraction that groups many partitions under a single replication channel. Partitions sharing a log stream share one replica group, one leader, and one synchronization path. Changes from different partitions flow along the same log replication path, which reduces replication overhead and improves overall throughput.
On top of this, OceanBase uses the Multi-Paxos optimization: classic Paxos requires two RPC rounds per log entry, but once a stable leader is elected, steady-state log replication needs only one quorum acknowledgment. The result — majority-based durability without the latency cost most people associate with consensus protocols.
Traditional primary-standby architectures often use strong synchronous replication to avoid data loss. In this model, the primary database must wait until the standby database has also persisted the log before returning success to the application.
This ensures that the standby has complete logs if the primary fails. But the cost is also clear: if the primary, the standby, or the network between them has a problem, the primary may become stalled or even unavailable. In other words, the system is often forced into a difficult trade-off between data reliability and service availability. Achieving both at the same time is hard.
Multi-Paxos takes a different approach. Its multi-replica model — typically three or five replicas — is naturally based on quorum decisions. As long as a majority of replicas are alive and can communicate with one another, the system can continue accepting writes and reaching consensus.
Therefore, when a minority of replicas fail — for example, one replica fails in a three-replica deployment — the system can still prevent data loss while keeping the service available. This is difficult for a strongly synchronized primary-standby architecture to achieve cleanly.
Multi-Paxos and Raft share the same goal — majority-based agreement — but optimize for different things.
Raft optimizes for understandability and operational simplicity. Logs are strictly continuous: entry N+1 cannot commit before entry N. This makes the protocol easier to reason about, easier to implement correctly, and easier to debug. For most use cases, that's the right trade-off, and it's why Raft dominates newer systems.
OceanBase chose Multi-Paxos for a specific reason: large-scale, latency-sensitive, multi-region deployment. Two properties matter here:
Neither choice is universally better. If you're building a control plane with a handful of nodes, Raft is probably the right call. If you're running a multi-region OLTP cluster where one slow link shouldn't stall the whole database, the Multi-Paxos trade-offs become attractive.
OceanBase provides different replica types so that different workloads can balance data reliability, performance scalability, availability, and cost.
Full-featured replicas are currently the most widely used replica type. They contain complete logs and business data. Only full-featured replicas are included in the Paxos member group and participate in quorum decisions. The system's core write path, leader election, and strong consistency guarantees all depend on how full-featured replicas are distributed.
However, simply adding more full-featured replicas to improve read performance is not always a good idea. Since full-featured replicas participate in Paxos voting, expanding the Paxos member group can increase write latency. To solve this problem, OceanBase introduces a class of non-voting observer replicas, including read-only replicas and columnstore replicas.
Read-only replicas store complete business data and catch up with the leader replica through asynchronous replication. They are suitable for workloads with relatively lower read consistency requirements. Columnstore replicas organize baseline data in columnar format and are designed for large-scale analytical queries. Because these observer replicas do not participate in quorum decisions, they can expand read capacity without adding consensus overhead to the core transaction path. This allows transaction processing (TP) and analytical processing (AP) workloads to be separated.
Multi-data-center deployment can also reduce disaster recovery costs while ensuring data reliability. For example, in a three-data-center deployment across two regions, deploying full-featured replicas in the third data center would not only incur additional storage costs, but also introduce more cross-region synchronization overhead.
To address this, OceanBase introduces the arbitration service.
Arbitration service nodes do not store business data and do not handle routine data replication tasks. As a result, they require very little compute, storage, or network resources.
Their value becomes especially important in failure scenarios. When a data-center-level failure causes some full-featured replicas to become unreachable and the system can no longer form a majority, arbitration nodes can provide the critical votes needed for the surviving replicas to re-form a majority and complete leader election. This allows the system to quickly restore write capability.
In this way, OceanBase does not need to deploy full data replicas in the third data center, yet it can still achieve cross-data-center disaster recovery and high availability at relatively low cost.
Surviving a failure with no data loss is half the job. The other half is recovering fast enough that users don't notice.
OceanBase can achieve automatic recovery within 8 seconds — that is, Recovery Time Objective (RTO) < 8s. This capability is supported by several mechanisms working together.
The leader replica is not statically assigned once and for all. OceanBase uses an election protocol to select a leader replica from multiple replicas, and uses a lease mechanism to ensure that only one node believes it is the leader at any given time. When the leader replica fails or a network partition occurs, the majority waits for the lease to expire before initiating a new election. This prevents the old leader from still being alive somewhere while a new leader is elected elsewhere, which would otherwise create a dual-leader situation.
With this mechanism, failure detection and failover can be completed within seconds, and the new leader can quickly take over service.
Here's what that looks like in practice when a leader fails:
Unlike traditional databases where the entire database may depend on a single primary node, OceanBase refines the granularity of multi-replica synchronization down to the log stream level. Primary-standby switching and leader election are both performed at the log stream layer.
When a single physical server goes down, only the log streams for which that machine served as the leader are affected. In other words, only part of the data is affected. The remaining data continues to serve requests normally, and leader elections for the affected log streams can proceed in parallel.
This fine-grained design prevents a single point of failure from spreading across the entire cluster. It is also a key reason why OceanBase can reduce RTO to the seconds level: the recovery scope is limited to a local area rather than the entire database instance.
Applications usually do not connect directly to backend database nodes, known as OBServers. Instead, they connect through OceanBase Database Proxy, also known as ODP or OBProxy. As OceanBase's access-layer component, OBProxy is responsible for request routing and load balancing. It routes requests to the correct node based on partition information and leader/standby status.
When a leader switch occurs, OBProxy discovers the new leader through feedback or periodic refreshes and updates its routing information. As a result, applications do not need to change configurations or restart. They can continue accessing the new leader, and the failure is largely transparent to users.
OceanBase's Root Service is the central control service of the cluster, and it is itself replicated using Multi-Paxos for high availability. Nodes report their status through heartbeats. If a node remains unreachable for an extended period, it is marked as failed and removed from the Multi-Paxos member group. Replacement replicas are then created on other healthy nodes to ensure that the majority remains intact.
This allows the system to recover automatically even from machine-level or data-center-level failures. After failover completes within seconds, OceanBase can quickly restore its high-availability and disaster recovery capabilities.
The interesting part of OceanBase's HA story isn't any single mechanism — it's how the layers compose. Multi-Paxos alone gives you RPO = 0. Add log streams, and you get it at scale. Add fine-grained leader election, and a node failure becomes a local event instead of a cluster-wide outage. Add OBProxy, and the application never has to know.
If you want to see this in practice:
Reliability isn't built on the assumption that nothing fails. It's built on what the system does when something does.

AI era doesn't need another heavy, complex enterprise database. It needs agility. It needs flexibility. We went back to the drawing board to understand what an AI application actually needs from a database. Our answer is OceanBase seekdb


On the DABstep Global Leaderboard, OceanBase DataPilot agent has secured the top spot, maintaining a significant lead over the runner-up for a month. The secret to our SOTA results was a fundamental shift in engineering paradigm: moving from "Prompt Engineering" to "Asset Engineering."


RTO < 8s on the cluster isn't the same as RTO < 8s to the application. Trace how OBProxy's Location Cache, route feedback, and bounded retry close the gap.
