perf: investigate using RocksDB 2PC mechanism for Raft log

RocksDB supports a [2 phase commit mechanism](https://github.com/facebook/rocksdb/wiki/Two-Phase-Commit-Implementation) for use by [RocksDB transactions](https://github.com/facebook/rocksdb/wiki/Transactions) (which we do not currently use).

The 2PC mechanism allows persisting a `WriteBatch` (via a prepare operation) to the RocksDB WAL without the mutations being applied to the `MemTable`. A subsequent `WriteBatch` can be applied which either commits or rolls back the prepared mutations. The upshot of this mechanism is that a set of mutations can be written to the WAL for persistence and only later added to the `MemTable`. This differs from the normal mode of applying a `WriteBatch` where the batch is atomically written to the log and the operations are immediately added to the `MemTable`.

Currently, the Raft leaders and followers have essentially the same behavior with respect to the Raft log. The Raft state machine tells us to append some entries to the Raft log and sometime later it tells us to "commit" those entries (i.e. apply them). In the steady/happy state, Raft log entries are appended to the log and almost immediately committed. A short while later, a heuristic triggers truncation of the Raft log of entries that have been committed to all of the replicas. Currently that heuristic allows a modest amount of entries to build up before truncation, but there isn't a strict need for that. We could truncate the Raft log on followers immediately after an entry has been committed. Doing so could make catch up following a change in leadership more expensive, but let's ignore that for now.

So the steady/happy state on a follower essentially looks like:

1. Write Raft log entry X
1. Apply Raft log entry X
1. Delete Raft log entry X

And these operations happen in quick succession. The time being writing a Raft log entry and applying it is measured in milliseconds. And deletion happens rapidly as well. Under the hood, the above operations look like:

1. Write Raft log entry X (write Raft log entry to RocksDB WAL and insert into MemTable)
2. Apply Raft log entry X (apply batch inside Raft log entry: write to RocksDB WAL and insert into MemTable)
3. Delete Raft log entry X (add tombstone for Raft log entry: write to RocksDB WAL and insert into MemTable)

The contents of the Raft log entry are written twice to the WAL and twice to the MemTable. The MemTable is often flushed before the deletion occurs, so we experience 4x write amplification even before we start talking about normal RocksDB write amplification.

Can we use the RocksDB 2PC mechanism to eliminate part of this overhead? The high-level idea is to "prepare" a Raft log entry to RocksDB causing it to be written to the WAL. If the entry is committed we then write the commit marker to the WAL causing the entry to be both applied and the Raft log entry deleted. On startup, RocksDB scans the WAL and gives us access to the prepared-but-not-committed transactions which correspond to uncommitted Raft log entries. On leadership change, we'd want to rebuild the indexable Raft log so that we can fall back to the existing code. I haven't put much thought into how that would work.

This would be a large and tricky change. There are likely gaping holes in this proposal. The migration story is absent. The expected performance gains need to be verified via experimentation. What does the system do in the unhappy state where one of the replicas is significantly behind?

Cc @irfansharif, @bdarnell, @tschottdorf 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate using RocksDB 2PC mechanism for Raft log #16948

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: investigate using RocksDB 2PC mechanism for Raft log #16948

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions