Zero Data Loss, Fast Failover: How OceanBase Uses Multi-Paxos for High Availability

Yiwen Qian
Yiwen Qian
Published on June 18, 2026Updated on 2026-06-19
8 minute read
Key Takeaways
  • OceanBase uses Multi-Paxos to guarantee zero data loss (RPO = 0): a transaction is acknowledged only after a majority of replicas have persisted the commit log, so no single-machine — or even single-data-center — failure can lose committed data.
  • Log streams, non-voting replica types (read-only, columnstore), and a lightweight arbitration service let OceanBase scale reads and survive data-center loss without paying the full cost of replicating data everywhere.
  • Lease-based leader election at the log-stream level, combined with OBProxy's transparent rerouting and automatic replica rebuild, brings recovery time under 8 seconds (RTO < 8s) — without application-side changes.

It's 3 a.m. A node in your database cluster goes dark. The on-call engineer has two questions, in this order: Did we lose any committed transactions? And how long until the system is back?

For a distributed database, the honest answer to both questions depends on one thing: how the system reaches agreement when machines fail. OceanBase's answer is Multi-Paxos — a consensus protocol that, combined with log streams, replica types, and a lease-based election mechanism, lets the database guarantee zero data loss (RPO = 0) and recover in under 8 seconds (RTO < 8s) without human intervention.

This post walks through how those guarantees are actually built — starting from the consensus problem, then the protocol choices, the replica topology, and finally the failover path.

The High-Availability Stack at a Glance

Before diving in, here's the layered design this post will unpack:

LayerMechanismWhat It Buys You
ConsensusMulti-Paxos with quorum commitRPO = 0 against single-node and single-DC failure
Replication unitLog streams (multi-partition shared replication)Lower replication overhead at scale
Replica topologyFull + read-only + columnstore + arbitrationRead scalability and cost-efficient cross-DC DR
FailoverLease-based election at log-stream granularityRTO < 8s, partial failure stays partial
Client routingOBProxy with leader feedbackFailover transparent to applications

Each layer below addresses a specific failure mode — we'll work through them in order.

1. What Is Paxos, and How Does It Help OceanBase Prevent Data Loss?

Paxos solves a deceptively simple problem: a group of machines need to agree on a value, even when some of them are slow, unreachable, or returning wrong answers. A useful mental model is a roomful of people trying to agree on one outcome — except some people may arrive late, go offline, or even give incorrect responses. The system still has to reach one decision, and that decision must not be overturned later.

That is the problem distributed consensus is designed to solve. The Paxos consensus protocol, proposed by Turing Award winner Leslie Lamport, is a classic solution to this problem. Its core idea is simple but powerful: a decision is considered valid only after it has been accepted by a majority of participants, also known as a quorum.

Why a majority? Because mathematically, any two majorities must have at least one member in common. In other words, the system cannot end up in a permanent split where "one half says A" and "the other half says B." This prevents split-brain at the root — a dangerous scenario in which two replicas both believe they are the leader and independently accept write requests at the same time.

In OceanBase, data is replicated and distributed across different physical machines, or nodes, in the form of replicas. Database modifications generate commit logs. These logs must not only be written to the current leader replica, but also be durably persisted on a majority of replicas before a successful commit response can be returned to the client.

So even if a single disk fails, or even if the physical node where the leader replica resides goes down, as long as a majority of replicas are still alive on other machines, the data has already been accepted by a quorum and persisted on more than one machine. It will not be lost because of a single-machine failure.

This is an important reason why OceanBase can achieve zero data loss in disaster recovery scenarios — that is, Recovery Point Objective (RPO) = 0. Its reliability is not built on the assumption that one special machine will never fail. Instead, it is built on the majority mechanism: once a change is accepted by a quorum, it becomes part of the durable history of the system.

There's a scaling problem, though. If every partition ran its own independent Paxos instance, a busy cluster would generate thousands of separate log synchronizations and quorum acknowledgments per second — each with its own RPC round trip. The network and coordination overhead would grow quickly.

OceanBase collapses that traffic with log streams — an abstraction that groups many partitions under a single replication channel. Partitions sharing a log stream share one replica group, one leader, and one synchronization path. Changes from different partitions flow along the same log replication path, which reduces replication overhead and improves overall throughput.

On top of this, OceanBase uses the Multi-Paxos optimization: classic Paxos requires two RPC rounds per log entry, but once a stable leader is elected, steady-state log replication needs only one quorum acknowledgment. The result — majority-based durability without the latency cost most people associate with consensus protocols.

2. How Multi-Paxos Compares with Strong Primary-Standby Sync and Raft

Traditional primary-standby architectures often use strong synchronous replication to avoid data loss. In this model, the primary database must wait until the standby database has also persisted the log before returning success to the application.

This ensures that the standby has complete logs if the primary fails. But the cost is also clear: if the primary, the standby, or the network between them has a problem, the primary may become stalled or even unavailable. In other words, the system is often forced into a difficult trade-off between data reliability and service availability. Achieving both at the same time is hard.

Multi-Paxos takes a different approach. Its multi-replica model — typically three or five replicas — is naturally based on quorum decisions. As long as a majority of replicas are alive and can communicate with one another, the system can continue accepting writes and reaching consensus.

Therefore, when a minority of replicas fail — for example, one replica fails in a three-replica deployment — the system can still prevent data loss while keeping the service available. This is difficult for a strongly synchronized primary-standby architecture to achieve cleanly.

Multi-Paxos and Raft share the same goal — majority-based agreement — but optimize for different things.

Raft optimizes for understandability and operational simplicity. Logs are strictly continuous: entry N+1 cannot commit before entry N. This makes the protocol easier to reason about, easier to implement correctly, and easier to debug. For most use cases, that's the right trade-off, and it's why Raft dominates newer systems.

OceanBase chose Multi-Paxos for a specific reason: large-scale, latency-sensitive, multi-region deployment. Two properties matter here:

  • Out-of-order commit. Multi-Paxos allows gaps in the log and acknowledges entries independently. A brief network hiccup on one log entry does not stall every entry behind it. In a Raft system, the same hiccup can cause head-of-line blocking on the entire log stream.
  • Lower heartbeat overhead at scale. Raft's per-leader heartbeats grow with cluster size. Combined with log streams, OceanBase reduces the per-replica chatter, which matters in deployments with hundreds of nodes spread across regions.

Neither choice is universally better. If you're building a control plane with a handful of nodes, Raft is probably the right call. If you're running a multi-region OLTP cluster where one slow link shouldn't stall the whole database, the Multi-Paxos trade-offs become attractive.

3. Replica Selection: Balancing Availability, Performance, and Cost

OceanBase provides different replica types so that different workloads can balance data reliability, performance scalability, availability, and cost.

Full-featured replicas are currently the most widely used replica type. They contain complete logs and business data. Only full-featured replicas are included in the Paxos member group and participate in quorum decisions. The system's core write path, leader election, and strong consistency guarantees all depend on how full-featured replicas are distributed.

However, simply adding more full-featured replicas to improve read performance is not always a good idea. Since full-featured replicas participate in Paxos voting, expanding the Paxos member group can increase write latency. To solve this problem, OceanBase introduces a class of non-voting observer replicas, including read-only replicas and columnstore replicas.

Read-only replicas store complete business data and catch up with the leader replica through asynchronous replication. They are suitable for workloads with relatively lower read consistency requirements. Columnstore replicas organize baseline data in columnar format and are designed for large-scale analytical queries. Because these observer replicas do not participate in quorum decisions, they can expand read capacity without adding consensus overhead to the core transaction path. This allows transaction processing (TP) and analytical processing (AP) workloads to be separated.

Multi-data-center deployment can also reduce disaster recovery costs while ensuring data reliability. For example, in a three-data-center deployment across two regions, deploying full-featured replicas in the third data center would not only incur additional storage costs, but also introduce more cross-region synchronization overhead.

To address this, OceanBase introduces the arbitration service.

Arbitration service nodes do not store business data and do not handle routine data replication tasks. As a result, they require very little compute, storage, or network resources.

Their value becomes especially important in failure scenarios. When a data-center-level failure causes some full-featured replicas to become unreachable and the system can no longer form a majority, arbitration nodes can provide the critical votes needed for the surviving replicas to re-form a majority and complete leader election. This allows the system to quickly restore write capability.

In this way, OceanBase does not need to deploy full data replicas in the third data center, yet it can still achieve cross-data-center disaster recovery and high availability at relatively low cost.

4. Leader Election and Failure Recovery Mechanisms

Surviving a failure with no data loss is half the job. The other half is recovering fast enough that users don't notice.

OceanBase can achieve automatic recovery within 8 seconds — that is, Recovery Time Objective (RTO) < 8s. This capability is supported by several mechanisms working together.

  • Automatic Election and Split-Brain Prevention: The Lease Mechanism

The leader replica is not statically assigned once and for all. OceanBase uses an election protocol to select a leader replica from multiple replicas, and uses a lease mechanism to ensure that only one node believes it is the leader at any given time. When the leader replica fails or a network partition occurs, the majority waits for the lease to expire before initiating a new election. This prevents the old leader from still being alive somewhere while a new leader is elected elsewhere, which would otherwise create a dual-leader situation.

With this mechanism, failure detection and failover can be completed within seconds, and the new leader can quickly take over service.

Here's what that looks like in practice when a leader fails:

  • Fine-Grained Leader Switching: Local Failures Do Not Bring Down the Entire Database

Unlike traditional databases where the entire database may depend on a single primary node, OceanBase refines the granularity of multi-replica synchronization down to the log stream level. Primary-standby switching and leader election are both performed at the log stream layer.

When a single physical server goes down, only the log streams for which that machine served as the leader are affected. In other words, only part of the data is affected. The remaining data continues to serve requests normally, and leader elections for the affected log streams can proceed in parallel.

This fine-grained design prevents a single point of failure from spreading across the entire cluster. It is also a key reason why OceanBase can reduce RTO to the seconds level: the recovery scope is limited to a local area rather than the entire database instance.

  • Application-Layer Transparency: Smart Proxy Routing to the New Leader

Applications usually do not connect directly to backend database nodes, known as OBServers. Instead, they connect through OceanBase Database Proxy, also known as ODP or OBProxy. As OceanBase's access-layer component, OBProxy is responsible for request routing and load balancing. It routes requests to the correct node based on partition information and leader/standby status.

When a leader switch occurs, OBProxy discovers the new leader through feedback or periodic refreshes and updates its routing information. As a result, applications do not need to change configurations or restart. They can continue accessing the new leader, and the failure is largely transparent to users.

  • Automatic Repair After Failure: Removing Faulty Nodes and Rebuilding Replicas

OceanBase's Root Service is the central control service of the cluster, and it is itself replicated using Multi-Paxos for high availability. Nodes report their status through heartbeats. If a node remains unreachable for an extended period, it is marked as failed and removed from the Multi-Paxos member group. Replacement replicas are then created on other healthy nodes to ensure that the majority remains intact.

This allows the system to recover automatically even from machine-level or data-center-level failures. After failover completes within seconds, OceanBase can quickly restore its high-availability and disaster recovery capabilities.

What to Take Away

The interesting part of OceanBase's HA story isn't any single mechanism — it's how the layers compose. Multi-Paxos alone gives you RPO = 0. Add log streams, and you get it at scale. Add fine-grained leader election, and a node failure becomes a local event instead of a cluster-wide outage. Add OBProxy, and the application never has to know.

If you want to see this in practice:

  • Read the failover walkthrough in the OceanBase HA documentation for the exact sequence of events during a leader switch.
  • Try it yourself — spin up a 3-replica cluster with OBD, kill the leader node, and watch RTO with obclient's connection-retry timing.
  • Go deeper on log streams in our follow-up post on partition-to-log-stream mapping (coming soon).

Reliability isn't built on the assumption that nothing fails. It's built on what the system does when something does.


Share
X
linkedin
mail