Every distributed database must face two technical hurdles, SQL and transactions.
Different database systems use different approaches to provide support for transactions. This article will answer the following questions when describing the mechanisms of transactions in the OceanBase Database:
Let’s dive in.
Many people have become so accustomed to the ACID properties of transactions that they stopped thinking about the principles behind them. Even conventional relational databases implement these properties differently. New challenges arise for distributed databases in this respect.
Atomicity
In multithreaded programming, atomicity means that the intermediate result of an operation must be invisible to other threads. They are only able to see the states when the operation begins and when it ends. In databases, this is implemented by isolation. An atomic database transaction either succeeds or fails entirely. So, the challenge appears when errors occur. For example, the process stops responding, the network is disconnected, or the disk is full or cannot be written. The distributed nature of the OceanBase Database poses an additional challenge in that a transaction may modify data on multiple nodes. OceanBase Database uses the two-phase commit protocol to ensure the atomicity of transactions. Under this protocol, a transaction that involves multiple servers is either committed on all these servers or rolled back on all of them.
Consistency
Consistency means a set of rules that are specified by an application for data. These rules may or may not be embodied as database constraints. For example, in accounting, credits, and debits must be equal. An application achieves consistency through transactions. A database ensures that the data of the application is consistent through atomicity, isolation, and durability.
Isolation
Isolation describes a property that applies when concurrent transactions read or write the same pieces of data. In other words, the transactions conflict with each other. Different approaches lead to different results and problems. These problems include dirty read, non-repeatable read, phantom read, lost update, and write skew. The first three are easy to understand for a developer, but the last two are very rarely seen. Database systems offer different isolation levels to address these problems. The highest level is serializable. It means that the concurrent execution of multiple transactions must obtain the same results as that of the sequential execution of those transactions following a specific order. Read committed and read repeatable are two of the more common isolation levels. OceanBase Database V1.0 later supports the read committed isolation level. OceanBase Database V2.2 later supports the serializable level.
Durability
Durability guarantees that when a transaction has been committed, the modified data will not be lost (or will be recoverable) even in the case of a hardware or database system failure. In relational databases, this is achieved by recording a commit log before each modification and writing the log to a reliable storage media. Data is still stored in the database cache instead of being flushed to the disk. In Oracle, data is asynchronously flushed at the scheduled times or by other triggers, which will not be discussed here. OceanBase Database employs a unique mechanism in that transaction logs are not flushed to the disk until the transaction is committed. Data modifications stay in the memory and are only flushed to the disk once per day. The minor compaction mechanism was added later to dump incremental data to the disk when the memory becomes insufficient. Protected by commit logs and the high availability offered by three replicas, OceanBase Database users have little reason to worry about unexpected data loss or slow data restoration. The biggest difference between OceanBase Database and conventional relational databases is that it uses the Paxos protocol (or Multi-Paxos to be precise) to ensure the reliability of transaction logs.
Write operations in OceanBase Database differ from those in conventional databases in that, in OceanBase Database, write operations do not modify data blocks. Instead, it claims a new memory area to store the incremental data modifications. Multiple incremental modifications to the same record are organized as a linked list and stay in the memory. Read operations in OceanBase Database combine the data blocks that are first loaded into the memory with the related incremental blocks. Read/write splitting makes transactions in OceanBase Database unique.
OceanBase Database supports both the read committed and serializable isolation levels. Before you learn about the working mechanisms, review some basic concepts about how transactions are implemented in OceanBase Database.
Transaction version
The sequence of different queries and modifications must be determined in a database system. Many may believe that the order can be determined by time. However, time is unreliable. Even in a single-server database, the database time may fluctuate with the system clock of the server. In a distributed database, the time of different nodes may not be strictly consistent. Even when their clocks are synchronized, it is still impossible to determine the sequence of modifications when the delay in network communication comes into play. This is why OceanBase Database implements a simple and monotonically incrementing timestamp. It is related to time, but it is not time. It is used in many situations. This timestamp is similar to the system change number (SCN) in Oracle.
Commit version: identifies the sequence in which different transactions are committed. When a transaction is committed, the timestamp is captured as the committed version of the transaction. When a distributed transaction is handled, multiple participants use the same commit version to ensure that the data retrieved by a read request is complete and consistent. Every transaction that modifies a specific data record will leave a committed version of that record. These versions are then sequentially linked and associated with the record.
Snapshot version: A transaction reads a specific version of data before it modifies it. That version is the snapshot version. Subsequent read operations in OceanBase Database will only read the latest commit data whose commit version is no later than this snapshot version. The commit versions of all subsequent transaction commits will be greater than this snapshot version. This ensures that no dirty read, non-repeatable read, or phantom read will occur on individual SQL statements. Different snapshot versions are selected under different isolation levels. Under the read committed level, each SQL statement selects the latest commit version when the statement is executed as its snapshot version. In other words, data consistency is maintained at the statement level. Under the serializable level, the SQL statements select the latest commit version when the entire transaction began as their snapshot version. In other words, data consistency is maintained at the transaction level.
Two-phase Commit Protocol
OceanBase Database supports distributed transactions within a tenant (instance). The application is unaware of whether the transaction is a distributed transaction. It only needs to start the transaction. OceanBase Database follows the two-phase commit protocol when committing transactions. If all data modified by a transaction is hosted on the same node, the system follows a single-node commit protocol. Two roles are involved in the two-phase commit protocol: coordinator and participants. A list of participants is maintained on the server that hosts the session related to the transaction. OceanBase Database selects the first participant of the first transaction (Px) as the coordinator. When the transaction is to be committed, an end_trans message with the list of participants (such as Px, Py, and Pz) is sent to the coordinator (Px). The following two figures are the flowcharts for the coordinator and the participants, respectively. Both the coordinator and the participants will maintain a finite-state machine (FSM).
Coordinator workflow:
1. Px receives the end_trans message and creates the coordinator FSM. The coordinator enters the PREPARE state, sends a “prepare” message to all participants, and waits for their response.
2. (a) If a “prepare ok” message is received from every participant, the coordinator enters the COMMIT state and sends a “commit” message to all participants.
3. (b) If an “abort ok” message is received from a participant, the coordinator enters the ABORT state and sends an “abort” message to all participants.
Participant workflow:
1. A participant FSM is created when the DML statement is executed. When the “prepare” message is received, the participants write logs for the modifications to be done by the transaction and begin persisting the logs. This persistence process is only successful when it is complete on the majority of the three replicas.
2. Participants who have completed the log persisting enter the PREPARED state and respond to the coordinator with a “prepare ok” message. Otherwise, participants enter the ABORTED state and respond with an “abort ok” message.
3. (a) When a “commit” message is received from the coordinator, the participants write a commit record, enter the COMMITTED state and respond with a “commit ok” message. Other actions performed at this stage include deciding the commit version of the distributed transaction, updating the public version, and releasing row locks.
4. (b) When an “abort” message is received from the coordinator, the participants write an abort record, enter the ABORTED state and respond with an “abort ok” message.
One problem with the conventional two-phase commit protocol is that the large lock granularity of participants blocks data reads. Another problem occurs when the coordinator fails, which results in a state of uncertainty and causes the commit to hanging indefinitely. This may cause some to worry.
The coordinator itself is a high-availability node with three replicas. It can recover from a failure in about 15 seconds. When no message is received from the coordinator, participants will resend their previous message after a set time. After the coordinator recovers from its failure, it can recover to the latest state based on messages sent from the participants despite the lack of the coordinator log.
GTS (Global Timestamp Service)
External consistency means that, if Transaction T1 is complete when Transaction T2 starts committing, the commit version of T1 must be smaller than that of T2. However, this is not guaranteed if the commit version is generated using the local timestamp of different nodes. In this case, the data contained in the snapshot version does not meet the requirement of the application. In a distributed database middleware, when an SQL query accesses data from multiple nodes, the middleware node sends the SQL query to those nodes and aggregates the returned data at a central node. On each node, the SQL query follows the limits of the read committed isolation level. Strictly speaking, the data aggregated in this way may not follow the rules defined for the snapshot version. Unless weak consistency is explicitly enabled by the user, this must be prevented in OceanBase Database.
By default, OceanBase Database enforces strong-consistency reads. Data read by an SQL query must be of the latest commit version that is no later than the snapshot version. Under the read committed isolation level, the system maintains consistency on the statement level. Under the serializable level, the system maintains consistency on the transaction level.
GTS is adopted in OceanBase Database V2.0 and later. Each tenant has its GTS deployed in a client-server model. Each node of the tenant has a GTS client which processes the requests of that node. Each tenant has only one GTS server that relies on the leader replica of the __all_dummy table. This enables the GTS server to achieve high availability. GTS provides a consistent snapshot version, also known as the globally consistent snapshot version.
Transaction Concurrency Control
Transactions are deemed concurrent if they are performed at the same time but are not performed in sequence. Conflicts may arise when concurrent transactions read or write the same pieces of data.
OceanBase Database attempts to place an exclusive lock (or row lock) on the record to be modified. If a lock already exists, the transaction waits in a queue. Row locks ensure that only one transaction may modify a given record at a time. When the lock is released, the first transaction in the queue is notified. This queue prevents contending for locks. A linked list is maintained by OceanBase Database on the row to record historical changes and their commit versions.
In OceanBase Database, transactions do not wait forever for the lock to be released. Each SQL query has a timeout value specified for the ob_query_timeout variable, which defaults to 10 seconds. When the timeout is reached, a “lock wait timeout exceeded” error is returned by the DML SQL query. This is the same error message borrowed from MySQL, and the equivalent variable in MySQL is innodb_lock_wait_timeout.
As was previously mentioned, read operations in OceanBase Database access a snapshot version of the record. This version is not locked and therefore other transactions are not prevented from writing to the same record. However, the SELECT…FOR UPDATE syntax does not access the snapshot version. It attempts to place a shared lock on the record. The lock is not released until the transaction is committed or rolled back. This creates conflict with concurrent write transactions.
In-depth Explanation of Transaction Isolation Levels in OceanBase Database
Two isolation levels supported by OceanBase Database are mentioned earlier in this article: read committed and serializable. The features and associated problems of the read committed isolation level is already well-known and will not be repeated here. Instead, this post will talk more about the serializable level. By its definition, concurrent transactions under the serializable isolation level must produce the same result as if those transactions were run in series. The only way to satisfy this requirement is to run the transactions in sequence. To implement this, data accessed by a transaction must be locked (by an exclusive or a shared lock) throughout its duration to prevent other transactions from modifying it. This approach reduces the concurrency. In Oracle Database, transactions access the same snapshot version when implementing the serializable isolation level. In this way, read/write conflicts are reduced, and concurrency is improved.
On the other hand, a write skew may occur at this isolation level in Oracle. At present, OceanBase Database addresses this problem in the same way as Oracle does.
Solutions Related to Transactions in OceanBase Database
Read/Write Splitting
Each piece of data (or partition) in the OceanBase Database has three replicas. Applications read and write the leader replica through OBProxy, and the follower replicas do not provide services. However, the follower replicas can be accessed when a weak consistency read is enabled. The risk is that the data accessed may not be up-to-date. The follower replicas synchronize with the leader replica by replaying its commit logs (or Clogs). In some distributed databases, this is done using the Raft protocol. Raft is easier to implement, but it has the drawback that the Clogs must be consecutive. Otherwise, the wait time is unavoidable, which affects the commit performance. OceanBase Database uses Multi-Paxos to synchronize Clogs. Clogs may be sent in parallel and randomly replayed on follower replicas. As a result, the commit versions on the incremental chain of follower replicas may not be consecutive at all times, and empty logs may exist. To make sure that the snapshot version read from the follower replica contains correct data, all commit logs created before that snapshot version must have been replayed already. The snapshot version must not be too early either, because old data is less useful for applications. The maximum delay allowed is configured using the max_stale_time_for_weak_consistency parameter, which defaults to 5 seconds. A follower whose most recent update happened before this time will not become the candidate. If the delays of both followers exceed this value, the application still reads from the leader. Weak consistency reads can be enabled in many ways. You can add a hint in your SQL queries, or configure it for the entire session.
select /*+ read_consistency(weak) */ * from t1 ...;orset session ob_read_consistency=WEAK; select * from t1 ...;
Enabling weak consistency reads also allows requests to access the read-only replica. Generally, a separate read-only replica is created in large-scale read/write splitting scenarios.
Table Replication in OceanBase Database
Some tables, such as base tables, must not be split even in a distributed architecture. After large business data tables are split, they are frequently joined with base tables. The business data is distributed among multiple nodes, and SQL queries on these tables follow a distributed execution plan with limited performance. To address this problem, some middleware products choose to broadcast the content of the non-split smaller tables to all replicas. This type of synchronization most often relies on external tools and is therefore asynchronous. The broadcast of small tables has the risk of data delay.
In OceanBase Database V2.2, a similar feature which is called replica table is supported. OceanBase Database has a natural advantage for implementing the replica table mechanism because the synchronization between replicas is performed within OceanBase Database. We only need to introduce a new replica type, the replication replica and apply strong synchronization between the leader and the replication replica. Full synchronization means that the leader only commits a transaction when the Clog is written to the disk of all replication replicas. This will cause the commit performance of the leader to decrease. However, tables that use the replica table mechanism do not have frequent data updates. So, the performance cost is acceptable. In addition, you can also specify the scope of replication replicas in OceanBase Database. Do not apply the replica table mechanism on frequently updated non-partitioned tables.
The replica table mechanism of OceanBase Database has a possible problem. When a replication replica becomes unavailable, the leader needs to wait for the recovery or timeout of this replica to commit a transaction. This will cause the commit performance of the leader to decrease. Therefore, if the replication replica becomes unavailable, it is soon added to the blacklist. Requests will not be sent to replicas on the blacklist. This is good for applications. Likewise, if a replication replica is recovered, it is soon removed from the blacklist to continue providing services. Detailed use cases of replica tables will be discussed in future posts.
Timeout Mechanism in OceanBase Database
Several timeout mechanisms are available in OceanBase Database, such as SQL execution timeout, transaction timeout, and idle timeout. Bigger values can be set for the SQL timeout and transaction timeout parameters to imitate the behavior in Oracle (never time out). This, however, only affects how problems occur instead of why.
Recommendations on OceanBase Database Transactions
Large transactions are slower to commit and take up more memory because a Clog in OceanBase Database is not generated and written to the disk before the corresponding transaction is committed. However, no strict rules exist to dictate whether a transaction is considered large. Factors including transaction concurrency and installed server memory all play a part. When concurrency is high, it is recommended to limit the number of records affected by the transaction to within 1000. Otherwise, it is fine to raise the limit to 10000. Note that they are merely recommendations instead of requirements mandated by logic in the code. The optimal setting can only be obtained by trial and error. We recommend that you do not use a single transaction to update hundreds of thousands or even millions of records in the OceanBase Database. It is easy to write such a statement for the application, but it transfers all costs and risks to the database, which is unwise.
Recommendations on Transaction Exception Handling
When exceptions occur, whether the transaction is committed or rolled back is solely determined by database rules, or how the finite-state machine changes from one state to another. This result may be different from the message received by the client. Generally speaking, if the client receives a success message, the database must have committed the transaction. If the client receives a message stating that the transaction was rolled back or killed, the database must have rolled back the transaction. If the client receives other exceptional messages, the transaction in the database must be in an unknown state. This is a common conclusion. It applies to all databases including OceanBase Database. Application developers may find it hard to understand when they run into this problem for the first time and mistakenly think that OceanBase Database has some problems. A basic principle for implementing a database is that the database must undo as little of what it has done as possible when an error occurs. A database must guarantee the atomicity, consistency, isolation, and durability (ACID) properties in exceptions. The application developer needs to consider how to recover from an exception. Many developers tend to be optimistic without thinking about the complexity of handling exceptions when implementing business logic. This would result in that exceptions constantly being thrown from the lowest level of the call stack until users see the error message. The purpose of the ABORT command is to safely retry transactions. Although retrying an abnormal transaction is a simple and effective mechanism, it has some flaws:
· If the database commits a transaction but a network connection error occurs when sending a response to the client, the client would consider that the transaction failed to be committed. Retrying a transaction would cause the transaction to be executed twice. If the data table has a unique constraint, it would avoid business data exceptions.
· If a database error is caused by node overloads (performance bottleneck), retrying a transaction would make the performance problem worse. The best way is to limit the times of retry and separately handle overload errors.
· For a temporary error such as deadlock, network jitter, or failover, we recommend that you retry the transaction. For a permanent error such as constraint violation, retrying the transaction is meaningless.
· If a business transaction includes other non-database operations such as sending emails and writing records, retrying such a transaction would bring additional side effects. Non-database operations are not supposed to be included in a business transaction in the first place. Otherwise, the two-phase commit mechanism must be implemented for the transaction.
The preceding exception handling analysis applies to all relational databases, including OceanBase Database. In the actual test, when the load on an OceanBase node is very high (CPU utilization close to 100%), the communication (RPC) between the distributed transaction coordinator and the participant may time out. This further results in the client receiving an error with unknown transaction status. Although this error can be avoided by adjusting relevant RPC parameters, this is not a fundamental solution. OceanBase Database can be elastically scaled online. To solve tenant performance bottlenecks, tenant or cluster resources must be planned in advance so that they can be immediately scaled online when needed.
Although business developers expect the database to return only two kinds of transaction messages (success or failure), the distributed architecture has its complexity and uncertainty, and with the uncertainty of the environment, the actual result is much more complicated than expected.
1. -6001: ERROR 6001 (25000): OB-6001: Transaction set changed during the execution: The transaction set is changed during execution. The transaction is rolled back.
2. -6002: ERROR 6002 (40000): OB-6002: the transaction is rolled back: The transaction is rolled back.
3. -6003: ERROR 1205 (HY000): OB-1205: Lock wait timeout exceeded; try restarting transaction: The time waiting for the lock exceeds the time-out value and the database tries to restart the transaction. The current SQL statement failed and the transaction status is not changed. Your application must retry the failed statement or perform other operations.
4. -6004: ERROR 6004 (HY000): OB-6004: Shared lock conflict: A row lock conflict occurs when executing INSERT/UPDATE/DELETE statements.
5. -6005: ERROR 6005 (HY000): OB-6005: Trylock row conflict: A conflict occurs when obtaining a row lock.
6. -6210: ERROR (25000): OB-4012: Transaction is timeout: The transaction is timed out. A rollback must be initiated to reuse the current connection.
7. -6211: ERROR 6211 (25000): OB-6002: Transaction is killed: The transaction is killed. The transaction is rolled back into the database.
8. -6213: ERROR 6002 (HY000): OB-6002: Transaction context does not exist: The transaction context does not exist. The transaction is rolled back into the database.
9. -6224: ERROR 6002 (25000): OB-6002: transaction need rollback: The transaction needs to be rolled back. The client does not need to initiate a rollback and the database rolls back the transaction.
10. -6225: ERROR 4012 (25000): OB-4012: Transaction result is unknown: The transaction result is unknown. The transaction may have been committed or rolled back in the database. The business operation needs to be performed again.
11. -6226: ERROR 1792 (25006): OB-1792: Cannot execute statement in a READ ONLY transaction: A statement cannot be executed in a read-only transaction.
In general, OceanBase Database employs a three-replica architecture, that is, each data record (availability zone) has three replicas: one leader and two followers. By default, only the leader provides the read-and-write services. The write feature of OceanBase Database is not to directly modify records in the data blocks of the memory, but to create a block of memory to store incremental changes. Multiple modifications of a record are associated in the form of a linked list. OceanBase Database transactional logs contain only redo logs (also called Clogs) and no undo logs. The leader writes a record to a Clog and initiates a request of persisting data to other replicas. The leader only commits the record when most of the replicas have received the Clog and written it to the disk. In this process, the Multi-Paxos protocol is used.
OceanBase Database also has other types of replicas: read-only replicas and replication replicas. The read-only replica contains only data and employs the weak-consistency model by default. The replication replica contains data and logs, stays in sync with the leader, and employs the strong consistency model by default. These two types of replicas can be used for implementing read/write splitting and table replication solutions. Compared with other databases, the read/write splitting and small-table-broadcasting solutions in OceanBase Database are more effective and correct. OceanBase Database supports transactions. In OceanBase Database, the two-phase commit protocol is used by default for transaction processing. In addition, the processing of standalone transactions is also optimized. In the two-phase commit process, one of the participants acts as the coordinator. Both the coordinator and participants have a three-replica high-availability mechanism. Regardless of which one (the coordinator or one of the participants) fails, it will be restored soon and continue processing the distributed transaction.
OceanBase transactions support two levels of isolation: read committed and serializable. This is the same as traditional databases. OceanBase Database supports global timestamps, which are similar to the System Change Number (SCN) used by mainstream commercial databases. Therefore, OceanBase Database 2. x supports global consistent reads based on snapshots.