Committing a 1 TB Transaction as Fast as 1 MB — How MaLT Makes Commit, Rollback, and Recovery Constant-Time

Chen Qian
Chen Qian
Published on June 24, 2026Updated on 2026-06-24
2 minute read
Key Takeaways
  • MaLT embeds transaction metadata directly into LSM-tree internals, making commit, rollback, and crash recovery take constant time regardless of transaction size (1 MB or 1 TB).
  • A 2.5-million-row bulk import completes 23.9% faster end-to-end, and crash recovery stays under 25 seconds even for 200 GB uncommitted transactions.
  • The key insight: stop treating the storage engine as a black-box KV store — let the LSM-tree understand transactions. MaLT ships in OceanBase 4.x production, serving thousands of customers.


Whether you commit 1 MB or 1 TB, MaLT's commit, rollback, and recovery time remains virtually unchanged. A single 2.5-million-row bulk data import saw its end-to-end time drop from 11,778 seconds to 8,966 seconds — 23.9% faster.

Both numbers come from the same change: MaLT stops treating the storage engine as a black-box key-value store and instead embeds transaction information directly into the LSM-tree. Developed by the OceanBase team and published at SIGMOD 2025, it is the first framework to manage large transactions natively within an LSM-tree — rather than abstracting them into KV reads and writes. It is deployed in OceanBase 4.x production, serving thousands of customers.

This article covers one thing: what happens when you move transaction state from outside the storage engine to inside the LSM-tree — and how that changes commit, rollback, recovery, and everyday reads and writes. We start with the problem: why a large transaction can bring an LSM-tree-based database to its knees.

Background: Why Large Transactions Break LSM-Trees

First, let's define "large transaction." Not a few dozen rows in a funds transfer, but hundreds of thousands to millions of rows in a single batch write: end-of-day batch processing in financial systems, data imports committing 2.5 million rows at once, with data volumes ranging from hundreds of gigabytes to 1 TB. What they share: the volume of uncommitted data far exceeds available memory.

LSM-trees are particularly vulnerable to this workload. Their write path works like this: data enters an in-memory MemTable first, then gets flushed to immutable on-disk SSTables once the MemTable is full. Before the transaction commits, these modifications are in an uncommitted state — invisible to other transactions, yet must remain rollback-ready at all times. Existing LSM-based systems handle this uncommitted data in two ways, each with a ceiling:

  1. Keep everything in memory (systems using the Percolator model): Simple to implement, but transaction size is limited by available memory — exceed it and you OOM.
  2. Allow spilling to disk (RocksDB's steal policy): Uncommitted data can be written to disk, but the entire WAL must be retained until commit — so transaction size is bounded by log capacity.

The second problem is crash recovery. ARIES-style recovery has two phases: redo and undo. The undo phase's duration is proportional to the size of unfinished transactions — the larger the transaction, the more operations to roll back, and row locks remain held throughout. The database is unavailable during this time. A 1 TB transaction that was mid-flight when the system crashed can leave the database offline for hours.

These problems share the same root cause. The core tension: most LSM-tree-based databases treat the storage engine as a black-box key-value store. Under this abstraction, committing a transaction means rewriting every modified row to update its version number to the commit version (backfill). Rolling back means overwriting them back (undo). Since SSTables are immutable, these rewrites can only be appended as new KV entries. The result: the additional I/O for commit and rollback is exactly proportional to the transaction's size, and because the engine has no visibility into transaction semantics, this overhead cannot be optimized away.

MaLT's goal is to tear down this black box.

Core Insight: Embedding Transaction Information into the LSM-Tree

Tearing down the black box means the LSM-tree no longer sees only keys and values — it understands transactions. MaLT does this in two steps: equip the LSM-tree with two transaction tables, and embed transaction metadata into every row.

Start with the two tables.

TCT (Transaction Context Table) manages active transactions. Every in-flight transaction has its context stored in the TCT: transaction state, commit version, and an operation index called Tx_Opts — recording which rows the transaction has modified, in execution order. The TCT resides entirely in memory. It only flushes to disk during checkpoints, via the standard LSM-tree flush path — no disk access during normal operation.

TDT (Transaction Data Table) manages terminated transactions. Once a transaction commits or aborts, its transaction data is transferred from the TCT to the TDT. The critical detail: the TDT is itself an LSM-tree (in-memory TMemTable + on-disk TSSTable), inheriting two LSM-tree properties: sequential writes and periodic background compaction to reclaim expired data. Transaction data moves from TCT to TDT via zero-copy — no data duplication, only reference transfer.

These two tables alone solve the "uncommitted data doesn't fit" problem: uncommitted modifications don't need to stay entirely in memory (the TCT stores only context), and the WAL doesn't need to be pinned. Row data lives in the LSM-tree; its state is tracked externally by TCT/TDT. Transaction size is no longer limited by memory or log capacity.

The second step embeds transaction metadata into every row. LSM-trees are already multi-versioned (each row carries a TxID and a version number — this globally incrementing commit version is called SCN in OceanBase, a convention we'll use throughout). MaLT adds three fields to each row on top of this:

  • Exist: Whether the row exists (distinguishes inserts from deletes);
  • Lock: Whether the row is held by a transaction's row lock;
  • Max_Version: The highest committed version number for this row.

These three fields enable concurrency control to complete entirely in memory: primary key conflicts check Exist, row lock conflicts check Lock, lost update detection checks Max_Version — none require reading the original row back from disk.

The entire design in one sentence: move transaction state and transaction metadata from outside the storage engine to inside the LSM-tree, embedded in rows. Let's trace a concrete transaction to see how this changes commit, rollback, recovery, and read/write paths.

After Moving Transactions into the LSM-Tree

Embedding transaction state into the LSM-tree is not an isolated optimization — it propagates through commit, rollback, recovery, and read/write paths. We examine each in turn. For concreteness, follow a transaction T: it updates user:42 from v5 to v6, and also inserts hundreds of thousands of new rows.

Commit and Rollback in Constant Time

During execution, T's modifications are written into the MemTable as uncommitted versions, each tagged with T's TxID. The TCT records T's context; Tx_Opts records which rows T has modified, in order.

At commit time, systems that treat the LSM-tree as a KV store must rewrite all those hundreds of thousands of rows, updating each version number to the commit version (backfill) — cost proportional to transaction size. MaLT does not rewrite. It marks T as committed in the TCT, records the commit SCN, transfers the transaction data to TDT — commit is done. The uncommitted versions remain in place, to be processed by compaction: when merging data across levels, each uncommitted version triggers a lookup of its owning transaction's state in TDT — committed means backfill the version number; aborted means discard. The WAL maintains a reclamation watermark (an SCN); compaction only processes terminated transaction data older than this watermark, then reclaims it.

Rollback follows the same path: abort just flips the state in TDT, and uncommitted versions are cleaned up during the next compaction. Commit and rollback are therefore independent of transaction size — whether T modified 1 MB or 1 TB, this step is just flipping a few state fields and transferring a reference.

Small transactions below a threshold take a different path: they use Tx_Opts to backfill/undo in-place within the MemTable, bypassing TDT and compaction entirely.

This works because the LSM-tree can now distinguish which versions belong to which transaction and whether that transaction has terminated. A black-box KV store lacks this information and must pay the full rewrite cost at commit time.

Recovery Without Undo

Suppose T hasn't finished committing when the database crashes. ARIES recovery runs redo then undo; the undo phase erases T's uncommitted modifications row by row, with duration proportional to T's size — the root cause of slow large-transaction recovery.

MaLT skips undo entirely. Redo proceeds as normal: first restore transaction contexts from the checkpoint-persisted TCT, then replay the log from after the checkpoint. The log is a physiological log — each entry carries an SCN — and replay can proceed out of order in parallel: when replaying a record, compare its SCN against the target row's current version SCN to decide whether to apply. No strict ordering required. After redo completes, all active transactions (including T) are restored to their pre-crash state.

Undo is eliminated entirely. Uncommitted versions don't need to be cleaned up during recovery: when they're read later, the system checks the corresponding transaction's status in TDT — if aborted, skip that version. Rollback cleanup is deferred to subsequent compaction. Recovery time is therefore independent of transaction size — state is visible in TDT, reads can make the determination on the fly, with no need to pause for row-by-row undo while holding locks.

One Fewer Disk Access on Read and Write Paths

The first two optimizations target large transactions specifically. This one benefits all transactions. With transaction information embedded in data, both reads and writes save one disk I/O each.

Reads: Each MemTable and SSTable carries version bounds — an SCN range and a RowKey range. A snapshot read first identifies the data sets whose bounds intersect with the query's snapshot version to determine the lower bound, then fetches data. Entire blocks whose bounds don't intersect are skipped — no need to read them in only to discover they don't contain the target version. To further reduce the cost of reading transaction data itself, MaLT adds a two-level TxDataCache for TDT: a query-level small cache (a few records, list structure) handles repeated reads of the same transaction data within a single query; an LRU large cache handles hot spots across queries.

Writes: Primary key conflict detection, row lock checks, and lost update detection traditionally require reading the corresponding row from disk to confirm. MaLT directly checks the inline Exist, Lock, and Max_Version fields — these write-path operations no longer access disk.

On TPC-C, MaLT with these optimizations enabled achieves ~23% higher tpmC and ~22% lower average latency than the version treating LSM-tree as a black-box KV store (OB-NOP).

Performance

Test machine: 64-core Intel Xeon Platinum 8369B, 400 GB memory. Comparison group includes OceanBase itself (OB), OB without MaLT (OB-NM), OB without read/write optimizations treating LSM-tree as pure KV (OB-NOP), MySQL 8.0, and two commercial distributed databases DB-A and DB-B. Benchmarks are primarily single-node to isolate the storage layer and exclude distributed factors like networking.

Bulk import. Simulating financial batch imports, end-to-end time drops from 11,778.4 seconds to 8,966.0 seconds — 23.9% faster.

Commit and rollback. Using Sysbench, scaling a single transaction from 1 GB to 1 TB and measuring commit and rollback latency separately. MaLT's curves are essentially flat; all other systems rise with transaction size. MySQL reaches hour-level rollback times at 1 TB scale, with overhead dominated by page-by-page undo and binlog sync. DB-B fails with errors once transactions exceed 50 GB. DB-A completes but is very inefficient with large transactions because it must actually rewrite data.

Throughput tells the same story: 10 threads concurrent, transactions from 512 MB to 500 GB — MaLT achieves the highest TPS at every size.

Crash recovery. Testing time-to-available after abnormal shutdown with transactions ranging from 20 GB to 200 GB. MaLT holds steady at 23–25 seconds, virtually unaffected by transaction size. OB-NM rises from 27 seconds at 20 GB to 175 seconds at 200 GB.

Online workload and overhead. Returning to regular workloads, TPC-C shows MaLT achieves ~23% higher tpmC and ~22% lower average latency compared to OB-NOP. TDT's space overhead is modest: 3 million transactions occupy 0.3 GB, 30 million transactions occupy 3.09 GB, and since TDT is itself an LSM-tree, compaction periodically reclaims space. Transaction data read latency is reduced by the TxDataCache, with the benefit more pronounced when repeatedly accessing the same batch of transaction data.

The ablation study (OB / OB-NOP / OB-NM) maps each layer of benefit back to the same change: embedding transaction information into the LSM-tree. The gains are not a stack of independent optimizations.

Limitations and Future Directions

Embedding transaction information into the LSM-tree has costs. More in-row versions mean reclamation depends on compaction — poor scheduling can cause expired transaction data to accumulate. Merging backfill/undo into compaction changes compaction's burden and triggering heuristics. TDT is itself an LSM-tree — space-efficient, but still represents additional space amplification. Furthermore, constant-time operations primarily target large transactions — for small, high-concurrency transactions, TCT/TDT introduces overhead, which is why MaLT provides a threshold-based fast path for small transactions that uses Tx_Opts without touching the transaction tables.

Two engineering challenges remain in distributed scenarios. The two-phase commit participant list can grow very long, bloating the log, requiring some participant information to be placed in TCT's execution context. Follower replay must keep up with the leader, requiring the same Tx_Opts index as the leader to maintain consistent progress.

One future direction is to backfill limited transaction information directly into SSTable metadata, enabling read/write optimizations to take effect without waiting for compaction, and removing the transaction tables from the critical path.

Regardless, MaLT transforms what was a cost linearly proportional to transaction size into a constant: whether 1 MB or 1 TB, commit, rollback, and recovery take roughly the same time. The method is one sentence — don't treat the storage engine as a black box; let the LSM-tree understand transactions. It ships in OceanBase 4.x production, serves thousands of customers, and the source code is in OceanBase's open-source repository.

Paper: MaLT: A Framework for Managing Large Transactions in OceanBase (SIGMOD 2025). Source code: OceanBase GitHub.
Share
X
linkedin
mail