From High Availability to Business Continuity: What Global Systems Actually Need

右侧logo

oceanbase database


Most enterprise teams have high availability configured. Far fewer have business continuity tested.

That gap matters more than it used to. Data sovereignty requirements are pushing workloads into regions that were once optional. Shared control planes mean a regional event can cascade further than the architecture diagram suggests. And the failure modes that actually threaten business continuity — power grid disruption, subsea cable damage, provider-level outages — sit outside the multi-AZ availability model entirely. The question is no longer whether your database is highly available. It's whether your business can continue when a region, a control plane, or a provider-level dependency goes down.

Four levels of failures

An availability architecture typically addresses up to four failure scopes. Most teams are well-covered at Levels 1 and 2. The gap between "highly available" and "business continuity" opens at Level 3.

LevelFailure scopeWhat DR requiresCockroachDBSpanner / AlloyDBAuroraOceanBase
1NodeRedirect to surviving replicaAutomatic (Raft)Automatic (Paxos)Automatic (storage-layer)Automatic (Multi-Paxos)
2Datacenter / AZQuorum across physically separated replicasMulti-AZ defaultMulti-AZ defaultMulti-AZ defaultMulti-AZ; 2F+1A option reduces storage cost
3RegionReplicas in a second geographic regionMulti-region native (added write latency)Synchronous replication (Google Cloud only)Async replication; RPO>0 for unplanned failoverSynchronous or async replication
4Cloud providerIndependent control plane and infrastructureCross-cloud for self-hosted clusters (user manages cross-cloud networking); managed cloud is single-provider per clusterGoogle Cloud onlyAWS onlyRuns on 7 clouds with independent control plane; provider outage doesn't block failover

At Level 3, the approaches diverge: Spanner and OceanBase offer synchronous multi-region with RPO=0; CockroachDB supports multi-region with configurable consistency; and Aurora Global Database provides async replication with RPO>0. The sharpest differentiation is at Level 4 — where control-plane independence and managed cross-cloud operations determine whether a provider-level outage disrupts your recovery.

The transitions that matter

  • Level 1→2 is about physical separation. Surviving a node is straightforward; surviving a datacenter requires replicas that don't share power, network, or cooling. All major distributed databases handle this well today.
  • Level 2→3 is where most "highly available" architectures stop. Three replicas in three AZs may be physically separate, but they typically share the provider's control plane, identity layer, quota mechanisms, and parts of the same network surface. During AWS's December 7, 2021 us-east-1 outage, an automated scaling activity triggered congestion on the networking devices connecting AWS's internal network to the main network. This outage cascading through internal DNS, monitoring, and authorization services before impacting broader services across the region. Multi-AZ deployments that appeared independent on architecture diagrams shared the same blast radius.
  • Level 3→4 is about vendor independence. A second region on the same cloud still shares the provider's control plane, IAM, and quota systems. True provider-level resilience requires infrastructure that fails independently. CockroachDB supports multi-cloud deployments for self-hosted clusters (a single cluster spanning AWS, GCP, and Azure via user-managed networking). This provides data-layer resilience across providers. However, the managed CockroachDB Cloud service currently runs each cluster on a single provider. Spanner and Aurora are locked to their respective clouds. OceanBase's differentiation at this level is the combination of multi-cloud availability and an independent control plane — so that a provider-level outage doesn't prevent OceanBase Cloud from managing failover, because the control plane isn't hosted on the affected provider's infrastructure.

Why level 2 doesn't cover level 3

Three gaps recur when teams honestly audit their DR posture:

Fault isolation is weaker than the diagram suggests. Control planes, IAM dependencies, rate limits, and service orchestration layers aren't always isolated the way application teams assume. When the provider has a bad enough day, "independence" between AZs becomes less real than it looks on the architecture slide.

"Can fail over" and "will fail over cleanly" are different things. A database cutover touches DNS, connection routing, TLS certificates, IP allowlists, secrets, dependency configuration, application reconnect behavior, and operational authority. A diagram that looks clean on paper can still become a multi-hour incident if any one of those steps fails under real pressure.

Runbooks decay faster than teams expect. Infrastructure changes. Teams rotate. Ownership transfers. A DR plan written against last year's topology, tested once during an off-peak weekend, and never exercised again is not a current capability. It's a historical artifact.

Continuity can't be measured by architecture alone. It has to be measured by drills.

How OceanBase addresses DR for each level

OceanBase's availability architecture was designed for Levels 1–4 from the start, not added as an afterthought. Here's how each level works — and where the current trade-offs are.

  • Level 1 (Node failure): When a single node fails, Multi-Paxos consensus ensures the remaining replicas continue serving requests. The leader election happens automatically — RTO under 8 seconds.

oceanbase database

  • Level 2 (AZ failure): The 2F+1A topology (two full replicas plus one arbiter) spans three availability zones. If an entire AZ goes down, consensus is preserved across the surviving zones — no data loss, automatic recovery.

oceanbase database

Level 3 (Region failure): OceanBase supports both synchronous and asynchronous cross-region replication. With synchronous replication, committed transactions are guaranteed to exist in both regions before acknowledgment — RPO=0 under normal network conditions. The trade-off is - synchronous cross-region replication adds write latency proportional to the network round-trip between regions.

Cold standby (async replication, lower cost):

oceanbase database

Warm standby (sync replication, RPO=0):

oceanbase database


Level 4 (Cloud provider failure): OceanBase Cloud runs on seven major public clouds with an independent control plane. Cross-cloud standby configurations replicate data via OceanBase Migration Service (OMS) with near-real-time sync. If one cloud provider experiences a control-plane outage, OceanBase Cloud can still manage failover to a standby cluster on another provider.

oceanbase database

The architecture isn't tied to a specific cloud's replication primitives. This is what makes cross-cloud DR possible without application changes.

Where to start

If you're making a platform decision — not reacting to a single incident — this sequence works:

Make cross-region real first. Pick the workloads that matter. Define RTO and RPO in writing. Run drills until you can hit those numbers consistently. This is Level 3 coverage, and it addresses the most common gap.

Add cross-cloud cold standby as the vendor backstop. This is the lowest-cost way to eliminate the total-loss scenario — Level 4 coverage at minimal operational overhead.

Upgrade only the systems that justify tighter coverage. Move truly critical services to warm standby or cross-cloud primary/standby when the business value justifies the cost and latency trade-offs.

The most common failure pattern is skipping step one — jumping straight to "multi-cloud" and then discovering that the cutover process is still manual, still brittle, and still unproven.

Five things to verify before your next DR review

Before the next review, confirm these with evidence, not assumptions:

  1. Backup restorability. When did you last restore from backup under production-like load? How long did it take?
  2. Drill recency. When was the last full end-to-end DR drill? Do you have timestamps, logs, and outcomes?
  3. Measured RTO/RPO. What are your actual measured numbers from that drill — not your target numbers, your real ones?
  4. Control-plane dependency map. Which of your "independent" replicas share IAM, DNS, or orchestration dependencies with the primary?
  5. Failover authority. Is there a documented, tested decision path for who triggers cutover and who owns rollback?

Two governance principles matter here. First, define when the clock starts and what "recovered" actually means for RTO/RPO measurement — teams frequently disagree on this during an actual incident. Second, require proof. If you can't produce drill logs from the last end-to-end exercise, you don't have a tested recovery objective. You have confidence without evidence.



What's next

OceanBase Cloud supports cross-region and cross-cloud DR — from cold standby to full primary/standby with transparent failover. Create your OceanBase cluster now to test your cutover assumptions against real infrastructure.

This is the first in a six-part series on multi-cloud disaster recovery. Stay tuned to dive deep into multi-cloud high availability capabilities of OceanBase.

ICON_SHARE
ICON_SHARE
linkedin
Contact Us