Overview|V4.3.1| docs|Distributed Database

Overview

Last Updated：2024-06-28 05:30:30 Updated

The architecture and implementation mode of OceanBase Database ensures strong consistency, integrity, and high availability (HA) of data.

Architecture

The data service layer is an OceanBase cluster. This cluster contains three sub-clusters (zones). Each zone comprises multiple physical servers, and each physical server is called a data node, or an OBServer node. OceanBase Database is based on the shared-nothing distributed architecture, and OBServer nodes are equivalent to each other.

The data stored in OceanBase Database is distributed on multiple OBServer nodes in a zone, and other zones store multiple data replicas. The data in the OceanBase cluster has three replicas. Each replica is stored in a separate zone. The three zones comprise a complete database cluster to provide services for users.

OceanBase Database can implement different levels of disaster recovery based on different deployment modes.

Server-level lossless disaster recovery: The unavailability of a single OBServer node is acceptable, and a lossless switchover can be automatically performed.
Zone-level lossless disaster recovery: The unavailability of a single IDC (zone) is acceptable, and a lossless switchover can be automatically performed.
Region-level lossless disaster recovery: The unavailability in a city (region) is acceptable, and a lossless switchover can be automatically performed.

If your cluster is deployed on multiple OBServer nodes in an IDC, server-level disaster recovery can be achieved. If OBServer nodes are deployed in multiple IDCs of a region, IDC-level disaster recovery can be achieved. If OBServer nodes are deployed in multiple IDCs across different regions, region-level disaster recovery can be achieved.

OceanBase Database ensures a recovery point objective (RPO) of 0 and a recovery time objective (RTO) of less than 8 seconds. This is equivalent to Level 6 disaster recovery capability, which is the highest level in the disaster recovery capability standards of China.

High availability

Multiple OBServer nodes in an OceanBase distributed cluster concurrently provide database services to ensure HA of database services. The application layer sends a request to OceanBase Database Proxy (ODP), also known as OBProxy, and ODP routes the request to the OBServer node where the desired service data is located. The request result is then returned to the application layer along the same path in the opposite direction. During the entire process, different components contribute to HA in different ways.

In a cluster consisting of OBServer nodes, all data is stored based on partitions. Each partition has multiple replicas to ensure high availability. The multiple replicas of a partition are distributed across different zones. Among the replicas, only one replica supports modifications and is called the leader, and other replicas are called followers. Data consistency is ensured between the leader and followers based on the Multi-Paxos protocol. If the OBServer node where the leader is located is down, a follower is elected as the new leader to continue to provide services.

The election service is the basis of HA. Among multiple replicas of a partition, one is elected as the leader based on the election protocol. This type of election is carried out when the cluster restarts or when the previously elected leader fails. The election service depends on the clock consistency among the OBServer nodes in the cluster. The clock deviation among the OBServer nodes cannot exceed 200 ms. To ensure clock consistency, you must deploy Network Time Protocol (NTP) or another clock synchronization service on each OBServer node in the cluster. The election service adopts a priority mechanism to ensure that the optimal replica is elected as the leader. The priority mechanism also takes the specified primary zone and the status of OBServer nodes into consideration.

After the leader starts to provide services, user operations produce data modifications. All modifications are logged and synchronized to the followers. OceanBase Database uses the Multi-Paxos distributed consensus protocol for log synchronization. Based on the Multi-Paxos protocol, if the persistence of log data is completed on the majority of followers, the log data is retained even if a minority of replicas fail. Based on the multiple synchronous replicas, the Multi-Paxos protocol ensures no data loss or service interruption when a minority of OBServer nodes are down. The downtime of a minority of OBServer nodes is acceptable to the data written by users. When an OBServer node is down, the system can select a new replica as the leader to continue to provide database services.

Global Timestamp Service (GTS) is activated for each OceanBase tenant. GTS provides a read snapshot version and a commit version for all transactions executed within a tenant to ensure that transactions are committed in order. When GTS is abnormal, transaction-related operations are affected. OceanBase Database ensures the reliability and availability of GTS in the same way as that for partition replicas. The GTS location for a tenant is determined by a special partition, which also has multiple replicas like other partitions. In this special partition, a leader is selected by using the election service, and the OBServer node where the leader is located also hosts GTS. If this OBServer node is down, the special partition selects another replica as the leader to provide services. In addition, GTS automatically switches over to the OBServer node where the new leader is located and continues to provide services.

The preceding content describes the key components that help ensure HA of OBServer nodes. ODP also requires HA to maintain its service performance. User requests are first forwarded to ODP. If ODP is abnormal, user requests cannot be properly processed. ODP also needs to handle OBServer node failures and perform failure tolerance operations.

Unlike a database cluster, ODP has no persistence state. ODP obtains the required database information by accessing the relevant database. Therefore, ODP failures do not cause data loss. ODP is also a cluster service consisting of multiple nodes. The specific ODP node used to execute user requests is subject to F5 or another load balancing component. If an ODP node is down, the corresponding load balancing component automatically removes the node to ensure that new requests are not forwarded to the node.

ODP monitors the database cluster status in real time. It obtains the cluster system table in real time to check the health status of each server and the real-time location of each partition. ODP also monitors the service status of OBServer nodes based on network connections. If ODP identifies an exception, ODP marks the corresponding OBServer node as failed and switches over the services.