This topic describes how redo logs ensure data durability and how redo logs are archived.
Overview
Redo logs are a key component of OceanBase Database that are used to recover from downtime and maintain data consistency among multiple replicas. Redo logs are physical logs that record all data changes in the database. To be specific, redo logs record the results of write operations. Redo logs can be replayed one by one from a specific persistent data version to recover it to the latest data version.
OceanBase Database uses the redo logs for the following two purposes:
Downtime recovery
Like most mainstream databases, OceanBase Database follows the write-ahead logging (WAL) principle. Redo logs are persisted before the transactions are committed to ensure the atomicity and durability of transactions, which conforms to the principle of atomicity, consistency, isolation, and durability (ACID). If an observer process exits or the server on which the process resides is down, you can recover data by restarting the OBServer and scanning and replaying local redo logs. Data that is not persisted when the server is down can be recovered by replaying redo logs.
Multi-replica data consistency
OceanBase Database uses the Multi-Paxos protocol to synchronize redo logs across multiple replicas. For a transaction, a redo log is considered successfully written only if the log is synchronized to the majority of replicas. The transaction can be committed only after all redo logs are successfully written. Eventually, all replicas receive the same redo logs and replay the logs to recover the data. This ensures that changes made by a committed transaction take effect on all replicas and are consistent between the replicas. OceanBase Database provides stronger disaster recovery capabilities by persisting redo logs on multiple replicas.
Log file types
OceanBase Database uses partition-level log streams. Logs of a partition must be logically consecutive and ordered. All log streams on a server are finally written to one log file.
OceanBase Database provides the following two types of redo log files:
clog
A commit log (clog) stores the content of a redo log. Clogs are located in the
store/clogdirectory. The log file ID starts from 1 and increases continuously. The file ID is not reused, and a log file is 64 MB in size. clog files record changes made to the data in the database and guarantee data durability.ilog
An index log (ilog) records the location information of commit logs with the same file ID in the same partition. These commit logs have become the majority of all logs. ilogs are located in the
storage/ilogdirectory. The log file ID starts from 1 and increases continuously. The file ID is not reused, and the size of a file is variable. The ilog files are indexes for clogs and contribute to the optimization of log management. The deletion of ilog files does not affect data durability but may affect the system recovery time. ilog files have no corresponding relationship with clog files. An ilog records much less content than a clog. Therefore, the number of ilog files is much less than that of clog files in general cases.
Log generation
The maximum size of a redo log in OceanBase Database is 2 MB. Historical operations such as data write and locking in a transaction are maintained in the transaction context during transaction execution. In versions earlier than V3.x, OceanBase Database converts historical operations in the transaction context to redo logs only when the transaction is committed. The redo logs are submitted to the clog module in units of 2 MB. Then, the clog module synchronizes the redo logs to all replicas and persists the logs. In V3.x and later versions, OceanBase Database provides the real-time log write feature. A redo log is generated and submitted to the clog module when data of the transaction reaches 2 MB in size. The unit is set to 2 MB for better performance. Every log that is submitted to the clog module is synchronized to the majority of replicas over the Multi-Paxos protocol, which requires a large amount of network communication and is time-consuming. In contrast to traditional databases, a single redo log in OceanBase Database aggregates the content from multiple write operations.
A partition in OceanBase Database may contain three to five replicas. Only one replica can serve as the leader to generate redo logs, whereas other replicas become followers that can only receive logs.
Log replay
The replay of redo logs is the basis on which OceanBase Database provides high availability. After logs are synchronized to a follower replica, the follower replica replays the logs in different task queues of the same thread pool. The task queues are assigned based on the hash value of transaction_id of the logs. In OceanBase Database, redo logs of different transactions are replayed in parallel, whereas redo logs of the same transaction are replayed in sequence. This guarantees a fast and accurate replay. Before logs are replayed in a replica, the context of the transaction is created. Historical operations are recovered in the context during the log replay. When the log recording the commit operation is replayed, the transaction is committed. This is actually another execution of the transaction on the image of the replica.
Log-based disaster recovery
By replaying redo logs, a follower executes the transaction that has been executed by the leader. This way, the follower obtains the same data state as the leader. If the server on which the leader of a partition resides fails or is overloaded and cannot provide services, you can elect a replica on another server as the new leader. The new leader can continue to provide services because it shares the same logs and data with the original leader. OceanBase Database can continue to provide services if no more than half of the replicas fail. A failed replica can replay redo logs after it is restarted to recover data that failed to be persisted. Finally, the failed replica obtains the same data state as the leader.
For traditional databases, the state of an active transaction is lost with the memory information when the server is down or a new leader replica is elected. Active transactions that are recovered by replaying logs can only be rolled back because their states are unknown. From the aspect of redo logs, the log recording the commit operation is not found after all the redo logs are replayed. In OceanBase Database, if a new replica is elected as the leader, an active transaction can write its data and state to logs for a certain period of time and submit the logs to the majority of replicas. In this way, the transaction can continue to be executed in the new leader.
Log control and recycling
Logs record all changes made to data in the database. Before recycling logs, you need to make sure that data related to the logs is persisted to disks. If you recycle logs before related data is persisted, the data cannot be recovered after a fault.
In the log recycling strategy of OceanBase Database, the following two configuration items are visible to you:
clog_disk_usage_limit_percentageThis item specifies the maximum percentage of clog or ilog disk space that can be used. The default value is
95. This is a forcible limit. If the actual percentage exceeds the upper limit, an OBServer disallows data writes from any new transactions and does not accept logs synchronized from other OBServers. For all the read and write transactions that access this OBServer, the"transaction needs rollback"error is returned.clog_disk_utilization_thresholdThis item specifies the water level for reusing the clog or ilog disk. When the system is running properly, the earliest log files stored on the clog or ilog disk are reused if the disk usage reaches the specified water level. The default water level is 80% of the independent space of the clog or ilog disk, which cannot be modified. Therefore, the percentage of the clog or ilog disk space occupied does not exceed 80% in normal cases. If the percentage exceeds 80%, the
"clog disk is almost full"error is returned to remind the database administrator (DBA).