OceanBase

Most teams have high availability (Level 1–2: node and AZ failures) but haven't tested business continuity (Level 3–4: region and cloud provider failures). The gap between "can fail over" and "will fail over cleanly" is where outages become incidents.

Four databases compared across four failure levels: Aurora and Spanner are locked to their respective clouds; CockroachDB supports cross-cloud for self-hosted but not managed clusters; OceanBase runs on seven clouds with an independent control plane for true Level 4 coverage.

Start with cross-region DR and real drills before jumping to multi-cloud. The most common failure pattern is skipping the fundamentals.

oceanbase database

Most enterprise teams have high availability configured. Far fewer have business continuity tested.

That gap matters more than it used to. Data sovereignty requirements are pushing workloads into regions that were once optional. Shared control planes mean a regional event can cascade further than the architecture diagram suggests. And the failure modes that actually threaten business continuity — power grid disruption, subsea cable damage, provider-level outages — sit outside the multi-AZ availability model entirely. The question is no longer whether your database is highly available. It's whether your business can continue when a region, a control plane, or a provider-level dependency goes down.

Four levels of failures

An availability architecture typically addresses up to four failure scopes. Most teams are well-covered at Levels 1 and 2. The gap between "highly available" and "business continuity" opens at Level 3.

Level	Failure scope	What DR requires	CockroachDB	Spanner / AlloyDB	Aurora	OceanBase
1	Node	Redirect to surviving replica	Automatic (Raft)	Automatic (Paxos)	Automatic (storage-layer)	Automatic (Multi-Paxos)
2	Datacenter / AZ	Quorum across physically separated replicas	Multi-AZ default	Multi-AZ default	Multi-AZ default	Multi-AZ; 2F+1A option reduces storage cost
3	Region	Replicas in a second geographic region	Multi-region native (added write latency)	Synchronous replication (Google Cloud only)	Async replication; RPO>0 for unplanned failover	Synchronous or async replication
4	Cloud provider	Independent control plane and infrastructure	Cross-cloud for self-hosted clusters (user manages cross-cloud networking); managed cloud is single-provider per cluster	Google Cloud only	AWS only	Runs on 7 clouds with independent control plane; provider outage doesn't block failover

At Level 3, the approaches diverge: Spanner and OceanBase offer synchronous multi-region with RPO=0; CockroachDB supports multi-region with configurable consistency; and Aurora Global Database provides async replication with RPO>0. The sharpest differentiation is at Level 4 — where control-plane independence and managed cross-cloud operations determine whether a provider-level outage disrupts your recovery.

The transitions that matter

Level 1→2 is about physical separation. Surviving a node is straightforward; surviving a datacenter requires replicas that don't share power, network, or cooling. All major distributed databases handle this well today.

Level 2→3 is where most "highly available" architectures stop. Three replicas in three AZs may be physically separate, but they typically share the provider's control plane, identity layer, quota mechanisms, and parts of the same network surface. During AWS's December 7, 2021 us-east-1 outage, an automated scaling activity triggered congestion on the networking devices connecting AWS's internal network to the main network. This outage cascading through internal DNS, monitoring, and authorization services before impacting broader services across the region. Multi-AZ deployments that appeared independent on architecture diagrams shared the same blast radius.

Level 3→4 is about vendor independence. A second region on the same cloud still shares the provider's control plane, IAM, and quota systems. True provider-level resilience requires infrastructure that fails independently. CockroachDB supports multi-cloud deployments for self-hosted clusters (a single cluster spanning AWS, GCP, and Azure via user-managed networking). This provides data-layer resilience across providers. However, the managed CockroachDB Cloud service currently runs each cluster on a single provider. Spanner and Aurora are locked to their respective clouds. OceanBase's differentiation at this level is the combination of multi-cloud availability and an independent control plane — so that a provider-level outage doesn't prevent OceanBase Cloud from managing failover, because the control plane isn't hosted on the affected provider's infrastructure.

Why level 2 doesn't cover level 3

Three gaps recur when teams honestly audit their DR posture:

Fault isolation is weaker than the diagram suggests. Control planes, IAM dependencies, rate limits, and service orchestration layers aren't always isolated the way application teams assume. When the provider has a bad enough day, "independence" between AZs becomes less real than it looks on the architecture slide.

"Can fail over" and "will fail over cleanly" are different things. A database cutover touches DNS, connection routing, TLS certificates, IP allowlists, secrets, dependency configuration, application reconnect behavior, and operational authority. A diagram that looks clean on paper can still become a multi-hour incident if any one of those steps fails under real pressure.

Runbooks decay faster than teams expect. Infrastructure changes. Teams rotate. Ownership transfers. A DR plan written against last year's topology, tested once during an off-peak weekend, and never exercised again is not a current capability. It's a historical artifact.

Continuity can't be measured by architecture alone. It has to be measured by drills.

How OceanBase addresses DR for each level

OceanBase's availability architecture was designed for Levels 1–4 from the start, not added as an afterthought. Here's how each level works — and where the current trade-offs are.

Level 1 (Node failure): When a single node fails, Multi-Paxos consensus ensures the remaining replicas continue serving requests. The leader election happens automatically — RTO under 8 seconds.

oceanbase database

Level 2 (AZ failure): The 2F+1A topology (two full replicas plus one arbiter) spans three availability zones. If an entire AZ goes down, consensus is preserved across the surviving zones — no data loss, automatic recovery.

oceanbase database

Level 3 (Region failure): OceanBase supports both synchronous and asynchronous cross-region replication. With synchronous replication, committed transactions are guaranteed to exist in both regions before acknowledgment — RPO=0 under normal network conditions. The trade-off is - synchronous cross-region replication adds write latency proportional to the network round-trip between regions.

Cold standby (async replication, lower cost):

oceanbase database

Warm standby (sync replication, RPO=0):

oceanbase database

Level 4 (Cloud provider failure): OceanBase Cloud runs on seven major public clouds with an independent control plane. Cross-cloud standby configurations replicate data via OceanBase Migration Service (OMS) with near-real-time sync. If one cloud provider experiences a control-plane outage, OceanBase Cloud can still manage failover to a standby cluster on another provider.

oceanbase database

The architecture isn't tied to a specific cloud's replication primitives. This is what makes cross-cloud DR possible without application changes.

Where to start

If you're making a platform decision — not reacting to a single incident — this sequence works:

Make cross-region real first. Pick the workloads that matter. Define RTO and RPO in writing. Run drills until you can hit those numbers consistently. This is Level 3 coverage, and it addresses the most common gap.

Add cross-cloud cold standby as the vendor backstop. This is the lowest-cost way to eliminate the total-loss scenario — Level 4 coverage at minimal operational overhead.

Upgrade only the systems that justify tighter coverage. Move truly critical services to warm standby or cross-cloud primary/standby when the business value justifies the cost and latency trade-offs.

The most common failure pattern is skipping step one — jumping straight to "multi-cloud" and then discovering that the cutover process is still manual, still brittle, and still unproven.

Five things to verify before your next DR review

Before the next review, confirm these with evidence, not assumptions:

Backup restorability. When did you last restore from backup under production-like load? How long did it take?

Drill recency. When was the last full end-to-end DR drill? Do you have timestamps, logs, and outcomes?

Measured RTO/RPO. What are your actual measured numbers from that drill — not your target numbers, your real ones?

Control-plane dependency map. Which of your "independent" replicas share IAM, DNS, or orchestration dependencies with the primary?

Failover authority. Is there a documented, tested decision path for who triggers cutover and who owns rollback?

Two governance principles matter here. First, define when the clock starts and what "recovered" actually means for RTO/RPO measurement — teams frequently disagree on this during an actual incident. Second, require proof. If you can't produce drill logs from the last end-to-end exercise, you don't have a tested recovery objective. You have confidence without evidence.

What's next

OceanBase Cloud supports cross-region and cross-cloud DR — from cold standby to full primary/standby with transparent failover. Create your OceanBase cluster now to test your cutover assumptions against real infrastructure.

This is the first in a six-part series on multi-cloud disaster recovery. Stay tuned to dive deep into multi-cloud high availability capabilities of OceanBase.

Download The Multi-Cloud DR Playbook to learn how cross-cloud database architectures reduce provider-level risk.

Disaster Recovery Cloud Database System Availability

Content

Four levels of failures

Why level 2 doesn't cover level 3

How OceanBase addresses DR for each level

Where to start

Five things to verify before your next DR review

What's next

Keep Reading

View all posts

PRODUCT

Exploring OceanBase 4.3: New Features and Enhancements

At the OceanBase DevCon 2024, we introduced the OceanBase 4.3.0 Beta, unveiling a brand new columnar engine. This release achieves near petabyte-scale, real-time analytics in seconds, and enhances the integration of TP and AP capabilities.

Ray YuJune 13, 2024

PRODUCT

How seekdb M0 Gives OpenClaw Persistent Memory and Shared Experience

OpenClaw's memory degrades over time—an architectural limitation, not a configuration issue. seekdb M0 solves this with cloud-based memory that persists across sessions and shares learned experience across agents.

Rongfeng FuApril 3, 2026

PRODUCT

Why Your Vector Database Benchmark Is Wrong for AI Agents

Under streaming AI workloads, vector databases see high P99 jitter (1.1×–10.3×) under concurrency. seekdb v1.3.0’s fixed delta+snapshot HNSW avoids this, delivering 22× QPS and 19× P99 gains over prior version.

Mike LiuMay 26, 2026