This topic describes how redo logs ensure data durability and how redo logs are archived.
Overview
Redo logs are a key component of OceanBase Database that are used to recover from downtime and maintain data consistency among multiple replicas. Redo logs are physical logs that record all data changes in the database. To be specific, redo logs record the results of write operations. Redo logs can be replayed one by one from a specific persistent data version to recover it to the latest data version.
OceanBase Database uses the redo logs for the following two purposes:
Downtime recovery
Like most mainstream databases, OceanBase Database follows the write-ahead logging (WAL) principle. Redo logs are persisted before the transactions are committed to ensure the atomicity and durability of transactions, which conforms to the principle of atomicity, consistency, isolation, and durability (ACID). If an observer process exits or the OBServer node on which the process resides is down, you can recover data by restarting the OBServer node and scanning and replaying local redo logs. Data that is not persisted when the server is down can be recovered by replaying redo logs.
Multi-replica data consistency
OceanBase Database uses the Multi-Paxos protocol to synchronize redo logs across multiple replicas. For a transaction, a redo log is considered successfully written only if the log is synchronized to the majority of replicas. The transaction can be committed only after all redo logs are successfully written. Eventually, all replicas receive the same redo logs and replay the logs to recover the data. This ensures that changes made by a committed transaction take effect on all replicas and are consistent between the replicas. OceanBase Database provides stronger disaster recovery capabilities by persisting redo logs on multiple replicas.
Log file types
OceanBase Database uses tenant-level log streams. Logs of a partition must be logically consecutive and ordered. The logs generated by all tablets of the same tenant on a server are written to the same log stream. The logs in the log stream are in total order.
In OceanBase Database, redo log files are referred to as commit log (clog) files. Clog files record the content of redo logs and are located in the store/clog/tenant_xxxx directory and numbered from 0, where xxxx represents the tenant ID. The file IDs will not be reused. Each clog file is 64 MB in size. Clog files record changes made to the data in the database and guarantee data durability.
Log generation
The maximum size of a redo log in OceanBase Database is 2 MB. Historical operations such as data write and locking in a transaction are maintained in the transaction context during transaction execution. In versions earlier than V3.x, OceanBase Database converts historical operations in the transaction context to redo logs only when the transaction is committed. The redo logs are submitted to the clog module in units of 2 MB. Then, the clog module synchronizes the redo logs to all replicas and persists the logs. In V3.x and later versions, OceanBase Database provides the real-time log write feature. A redo log is generated and submitted to the clog module when data of the transaction reaches 2 MB in size. The unit is set to 2 MB for better performance. Every log that is submitted to the clog module is synchronized to the majority of replicas over the Multi-Paxos protocol, which requires a large amount of network communication and is time-consuming. In contrast to traditional databases, a single redo log in OceanBase Database aggregates the content from multiple write operations.
A partition in OceanBase Database may contain three to five replicas. Only one replica can serve as the leader to generate redo logs, whereas other replicas become followers that can only receive logs.
Log compression
OceanBase Database supports log compression for transmission. You can specify the tenant-level parameter log_transport_compress_all to determine whether to compress redo logs for network transmission. Log compression for transmission can reduce the network bandwidth occupied for redo log synchronization. OceanBase Database also supports log compression for storage. You can specify the tenant-level parameter log_storage_compress_all to determine whether to compress redo logs for storage. Log compression for storage can reduce the occupied I/O bandwidth and storage space of the log disk.
Log replay
Note
OceanBase Database supports parallel replay and parallel submission of redo logs at the transaction layer.
The replay of redo logs is the foundation of the high availability capability provided by OceanBase Database. After logs are synchronized to a follower replica, the follower replica will hash the logs based on the transaction_id and index in the linked list of callback operations, and then distribute them into different task queues within the thread pool for log replay in the current tenant. Redo logs of different transactions in OceanBase Database can be replayed in parallel, and different redo logs of the same transaction can also be replayed in parallel. This approach ensures both the correctness and the speed of log replay. During the replay on a replica, a transaction context is first created, and then the operation history is reconstructed within that transaction context. Finally, when reaching the clog, the transaction is committed. This is actually another execution of the transaction on the image of the replica.
Log-based disaster recovery
By replaying redo logs, a follower executes the transaction that has been executed by the leader. In this way, the follower obtains the same data state as the leader. If the server on which the leader of a partition resides fails or is overloaded and cannot provide services, you can elect a replica on another server as the new leader. The new leader can continue to provide services because it shares the same logs and data with the original leader. OceanBase Database can continue to provide services if no more than half of the replicas fail. A failed replica can replay redo logs after it is restarted to recover data that failed to be persisted. Finally, the failed replica obtains the same data state as the leader.
For traditional databases, the state of an active transaction is lost with the memory information when the server is down or a new leader replica is elected. Active transactions that are recovered by replaying logs can only be rolled back because their states are unknown. From the aspect of redo logs, the log recording the commit operation is not found after all the redo logs are replayed. In OceanBase Database, if a new replica is elected as the leader, an active transaction can write its data and state to logs for a certain period of time and submit the logs to the majority of replicas. In this way, the transaction can continue to be executed in the new leader.
Log supplement
When executing transactions, the database system generates redo logs to ensure operation durability and recoverability for UPDATE statements. In the default full mode, a redo log records the complete information of the updated rows, including the column values that have not been modified. In minimal mode, a redo log records only the modified columns and their necessary context information to reduce the log size and enhance storage efficiency.
Log control and recycling
Logs record all changes made to data in the database. Before recycling logs, you need to make sure that data related to the logs is persisted to disks. If you recycle logs before related data is persisted, the data cannot be recovered after a fault.
In the log recycling strategy of OceanBase Database, the following two parameters are visible to you:
log_disk_utilization_limit_threshold: specifies the maximum disk space available for clog files, in percentage. The default value is95, indicating that the maximum disk space available for clog files accounts for 95% of the total disk space. This is a forcible limit. If the actual percentage exceeds the upper limit, an OBServer node disallows data writes from any new transactions and does not accept logs synchronized from other OBServer nodes. In this case, for all the read and write transactions that access this OBServer node, thetransaction needs rollbackerror is returned.log_disk_utilization_threshold: specifies the log disk usage threshold that triggers clog reusing. When the system is running properly, the earliest log files stored on the clog disk are reused if the disk usage reaches the specified threshold. The default value is 80% of the independent space of the clog disk, which cannot be modified. Therefore, the percentage of the clog disk space occupied does not exceed 80% in normal cases. If the percentage exceeds 80%, the"clog disk is almost full"error is returned to remind the database administrator (DBA).