Photo by Towfiqu barbhuiya on Unsplash
This article is the transcript of a tech talk, Infinite Possibilities of a Well-Designed Transaction Engine in Distributed Database, by Fusheng Han, Founding engineer of OceanBase. Fusheng has more than 14 years of experience in the database area and acquired more than 10 patents. He is passionate about distributed technologies and database systems and what he is doing now is building excellent software to better support massive data processing.
OceanBase is a distributed database system running on commodity servers. There are many advantages of commodity servers. They have a lower price and are easier to get, especially from cloud providers. Currently, commodity servers are the mainstream on cloud platforms. However, commodity servers have comparably lower performance and higher failure rates compared to proprietary hardware.
Mission-critical applications rely on the high availability and high performance of database systems, so the objectives of OceanBase are to utilize a cluster of servers to provide high availability of database service and to provide distributed capabilities with these hardware resources. Most importantly, OceanBase wants to provide the same experience as conventional database systems.
In a cluster with multiple servers, OceanBase replicates data among servers. When some servers encounter a failure, other replicas of data can be elected as new leaders to continue services. It’s challenging to make sure the data is always consistent between different replicas, so OceanBase uses Paxos consensus protocol to synchronize transaction mutation logs between different replicas. This is the overall architecture:
OceanBase employs the Write-Ahead Logging (WAL) algorithm, so mutations generated by the SQL engine are stored in the storage directly. The transaction engine generates redo logs for these mutations at the same time. When the transaction wants to commit, the transaction engine flushes all log records to the logging component. In the logging component, the Paxos consensus protocol will replicate logs to other replicas. When the logs are chosen by the majority of replicas, the logging component will inform the transaction engine that the logs are persisted successfully. Then, the transaction engine will respond to the application that the transaction is committed successfully. This is a typical execution path. And in the follower, it always replaces transaction logs read from the logging component.
OceanBase is different from many other distributed systems which rely on the replicated state machine model. The logging component in OceanBase provides a file-system-like interface. But it is convenient for the transaction engine to use the append operation to flush logs. OceanBase also co-designs the consensus protocol with the transaction engine. It uses the ring buffer in the group commit not only to aggregate concurrent transaction logs but also to do replication and IO persistence, so the logging component can support high throughput in a single Paxos group.
Recovery time refers to how much time the database could recover from a failure. The less the recovery time is, the fewer the application interruption will be. Recovery in OceanBase has a sophisticated process.
At the bottom, the election algorithm has to elect the new leader in the remaining replicas. After the failure occurs, the new leader has to replay all transaction logs to bring its data up to date and recover the database service. The location of the newly elected leader has to be reported, and the location cache in many components across the system can be refreshed. And the RPC module has to do automatic failure detection. Otherwise, the center of RPC may wait for the result from the failed server for a long time. OceanBase could recover database service whenever the majority of services are alive and connected.
OceanBase has a deterministic priority-based election algorithm, which has a bonded election time. And the new leader has the best priority in the remaining replicas, and the location discovery is fast, too. And there are heartbeats in all communication modules to do failure detection with these methods. The recovery time of OceanBase is less than 8 seconds even in the worst cases.
Database log archiving and restore and active physical standby database are important features in database management. This may be routine work for a database administrator, but most distributed database systems don’t support this. OceanBase supports mirroring transaction logs between Paxos groups, all from the archived log storage to the Paxos group. One of the challenges to accomplish this is how to solve the reconfiguration problem in the Paxos consensus protocol. Membership reconfiguration refers to changing members in a Paxos group when servers are added or removed. Many Paxos implementations will write a membership change record in the logs, but this will corrupt the contents of the replicated logs for a Paxos group in the standby database. It will never be physically identical to the original logs. OceanBase solves this problem by separating the membership. Change the record into an embedded meta storage. The metal storage itself is another independent Paxos instance, so the changing of the meta storage could be replicated to all the replicas through this independent Paxos instance. You can deploy active physical standby OceanBase and do archiving logs and restore OceanBase from the archived logs.
The transaction engine must support ACID properties in the cluster of servers. The two-phase commit has to be used for distributed transaction commit, which is a well-known complex commit protocol.
Actually, the transaction engine encounters more challenges in distributed database systems. Read your writes means that a SQL must read uncommitted mutations made by previous SQLs in the same transaction. Huge transactions are those modifying a very large amount of records in one transaction. Most distributed database systems don’t support this. And it’s also important to support pessimistic concurrency control. A lot of distributed database systems only support optimistic concurrency control, but application developers have to add retry loops to deal with contentions between concurrent transactions, which would complicate application development.
It’s also important to assure external consistency. It is trivial in conventional monolithic database systems. Applications may rely on this consistency without being aware of it. Transaction engines must support high performance, high throughput, and low latency. Besides, being compatible with conventional database systems is vital for OceanBase because it makes applications easier to migrate from other databases to OceanBase.
OceanBase combines pessimistic record-level locks and multi-version concurrency control. Each record contains two additional fields, transaction ID and transaction version. The transaction ID represents the transaction that generates this record. Every transaction has a unique ID, which is acquired at the beginning of the transaction. The transaction version represents the commit version for a transaction, which is acquired during the transaction commit. When a transaction is executed, the transaction version field keeps invalid and is filled when the transaction commits. Uncommitted records also represent a record lock for this row, which prevents concurrent modifications from other transactions. Row data is never updated in place, so there are many versions for each row. Concurrent transactions will bypass all of the other active transactions’ uncommitted data. And it also bypasses committed data whose commit version is greater than its read version.
Huge transactions are transactions that modify a large number of records like a million or even a billion records. Applications usually execute huge transactions during data maintenance or error correction. OceanBase uses a typical Steal/No-Force approach which can persist on committed mutations in a transaction so long as those corresponding redo logs have been persisted in write-ahead logging. OceanBase also flushes mutation redo logs continuously during transaction execution. So at transaction commit, there are only very few logs needing flush.
OceanBase also supports fast commit.
While a transaction modifies a large number of records, the system must fill the transaction version field in the modified rows with the commit version during the transaction commit. For like a million or billion records, this is a resource-consuming task. OceanBase records the latest transaction status in the transaction status table. The filling task is delayed. In the re-path, the system will combine the row data with the latest transaction status acquired from the status table and get the latest transacting information of the records. In the compaction process, which is executed periodically in the background, the correct commit version is actually filled.
OceanBase divides transactions into two kinds, local transactions, and distributed transactions, depending on how many streams are involved in the transaction.
In the figure above, Transaction A only modifies data in one stream, so it’s a local transaction. Transactions B and C have distributed transactions. And this information is automatically tracked in the session state during transaction execution. For local transactions, OceanBase uses the well-known write-ahead logging algorithm. All redo log records and a commit record is appended to write-ahead logging. When these logs are persisted successfully, the transaction commits successfully. It is easy in this scenario. But for distributed transactions, the two-phase commit protocol must be used. The challenge in using the traditional two-phase commit is the commit latency. There is log writes in both coordinate and participants, which would incur more latency because every log writes in OceanBase needs replicating to other replicas. So OceanBase uses an optimized two-phase protocol where only participants write logs and the coordinator has not any persistent states. There is no fixed coordinator in the system for each transaction. One of the participants takes the responsibility of the coordinator.
Let’s take a close look at the commit process.
When the transaction commits, the coordinator sends a Prepare request to all participants and asks all participants to flush the redo log records as well as prepare records for write-ahead logging. When the flushing finishes, the participants go to the Prepared state, then send back the Prepare OK response to the coordinator. When the coordinator receives all Prepare OK responses, it means all redo log records in the transaction have persisted successfully. The system can always recover the transaction no matter what kind of failure occurs, so the coordinator responds to the client that the transaction has been successfully committed. In this figure, we can see that the commit latency is only one round of replications. And the following tasks are done in the background. The coordinator continues to send commit requests to all participants and the participants submit the commit records and transfer them to the Committed state, then respond to the coordinator with the Commit OK message. After the coordinator receives all Commit OK messages, it sends clear requests to all participants to clear the transaction state.
OceanBase uses a centralized Global Timestamp Service (GTS) to generate all read versions and commit versions. There is one global timestamp service for each tenant. It is also a highly available service along with the underlying Paxos group and promises to generate increasing timestamps which assure external consistency in the system.
Let’s see two examples.
Transaction 1’s commit operation starts after Transaction 2 finishes its commit, so Transaction 2’s commit version, which is fetched during the commit process, is definitely greater than Transaction 1’s commit version,
For stand-alone queries, the read versions also assure external consistency. If Query 2 starts after Query 1 finishes its execution, then Query 2’s read version, which is also fetched after Query Start, is definitely greater than Query 1’s read version.
For high availability, OceanBase uses Paxos consensus protocol to replicate transaction logs. And it has high performance for a single Paxos group. The recovery time of OceanBase during a failure is less than 8 seconds. OceanBase supports log archiving and restore, and OceanBase can deploy a physical standby database for the transaction engine.
OceanBase combines pessimistic record-level locks and multi-version concurrency control. It supports huge transactions, and it also supports fast commit for huge transactions. OceanBase uses an optimized two-phase commit protocol, and the commit latency is only one round of replication, both for local transactions and distributed transactions. OceanBase also assures external consistency across all transactions.
Here is the transcript of the storage engine tech talk.
Yes. In each stream, there is a Paxos group with multiple replicas, so each record persistence means the persistence in the local hard drive and also replicates into other replicas and gets majority consensus. Then it successfully persisted.
Yes, it just takes the file system API of the logging system, isn’t it?
Yes, it is a feature in the logging component because we know that many other systems use the Submit and Apply scheme to implement the Paxos consensus protocol. The leader submits the commands to the Paxos consensus protocol. After the commands get the majority chosen that all replicas in leader and in follower will apply the same commands. But in the OceanBase, the leader of the replica only appends the log records into the logging component. So the append operation is very like the file system interfaces. So, the append operation is asynchronous after the append operation is invoked. The upper component gets a callback structure. So when the logs get a majority chosen, the callback is called and informs the upper layer that the appended logs get persisted successfully. So in a leader, the path is different from the follower. The leader appends log records and gets the response. And in followers, always use the same file-system-like interface to read the logs from the logging component and replay with logs.
Okay, I got it. And the question is about optimization in OceanBase’s Paxos implementation.
As far as I know, the parallel Raft of PolarFS is just a name. Actually, it’s a typical Paxos implementation, just like most other Paxos implementations. There are some differences in terms of the implementing details.
We know that Multipaxos is different from Raft because it can accept logs not continuously while Raft always accepts all the submit logs continuously. Actually, we know that database systems always write redo logs continuously. For example, in local transactions, all conventional database systems write log records. If we do these contents in the Multipaxos implementation, it will incur more complexities like what if the commit record got the majority chosen but the redo record 2 is not? It will complicate the recovery process because the redo record 2’s position may be filled with an operation during Paxos recovery, so we get the record 1 nop and commit record in the logs. So for database systems, it will incur more complexities. So the continuous form is better suited for database systems, but it will not harm the performance because in the typical database system we use group commit for the concurrent transaction log writing in the server with more than 100 physical costs, and it can sustain the high throughput. Also in the Paxos implementation, this kind of optimization is enough to utilize all the physical costs in a single server. So in OceanBase’s Paxos implementation, we do aggregations about all of the concurrent transactions, so all the concurrent transactions’ redo logs are combined together to do the I/O, to do replication. So in our test, it can run more than 1GB per second to sustain all the network bandwidth. So the protocol itself does not constrain the performance, the implementation matters. So if you use fine-grinded logs and use some log-free data structures to optimize the implementations, you can get good performance.
The answer to the question is OceanBase only relies on local storage. We use a shared-nothing architecture, so only local storage is used. We choose this because it is more common for commodity servers to have local storage, but the shared storage does not have a lower price or may have lower performance. So we choose local storage.