Skip to content

storage: skip Pebble write-ahead log during Raft entry application #38322

@nvb

Description

@nvb

CockroachDB currently stores Raft log entries and data in the same RocksDB storage engine. Both log entries and their applied state are written to the RocksDB WAL, even though only log entries are required to be durable before acknowledging writes. We use this fact to avoid a call to fdatasync when applying Raft log entries, opting to only do so when appending them to a Replica's Raft log. #17500 suggests that we could even go further and delay Raft log application significantly, moving it to an asynchronous process.

Problems

This approach has two main performance problems:

Raft Log Entries + Data in a shared WAL

Raft log appends are put in the same WAL as written (but not fsync-ed) applied entry state. This means that even though applied entry state in the WAL isn't immediately made durable, it is synced almost immediately by anyone appending log entries. This slows down the process of appending to the Raft log because doing so often needs to flush more of the WAL than it's intending to.

#7807 observed that we could move the Raft log to a separate storage engine and that doing so would avoid this issue. The Raft log would use its own write-ahead log and would get tighter control over the amount of data that is flushed when it calls fdatasync. That change also hints at the use of a specialized storage engine for Raft log entries. The unmerged implementation ended up using RocksDB, but it could have opted for something else.

Data written to a WAL

Past the issue with applied state being written to the same WAL as the Raft log, it is also a problem that applied state is written to a WAL, to begin with. A write-ahead log is meant to provide control over durability and atomicity of writes, but the Raft log already serves this purpose by durably and atomically recording the writes that will soon be applied. Paying for twice the durability has a cost - write amplification. A piece of data is often written to the Raft log WAL, the Raft log LSM (if not deleted in the memtable, see #8979 (comment)), the data engine WAL, and the data engine LSM. This extra write amplification reduces the overall write throughput that the system can support. Ideally, a piece of data would only be written to the Raft log WAL and the data engine LSM.

Proposed Solution

One fairly elegant solution to address both of these concerns is to move the Raft log to a separate storage engine and to disable the data engine's WAL. Doing so on its own almost works, but it runs into issues with crash recovery and with Raft log truncation.

Both of these concerns are due to the same root cause - naively, this approach makes it hard to know when applied entry state has become durable. This is hard to determine because applied entry state skips any WAL and is added only to RocksDB's memtable. To determine when this data becomes durable, we would need to know when data is compacted from the memtable to L0 in the LSM. Furthermore, we often would like to know which Raft log entries this now-durable data corresponds to, which is difficult to answer. For instance, to determine how much of the Raft log of a Replica we can safely truncate, we'd like to ask the question: "what is the highest applied index for this Replica that is durably applied?".

The solution here is to use the RangeAppliedStateKey, which is updated when each log entry is applied (or at least in the same batch of entries, see #37426 (comment)). This key contains the RaftAppliedIndex, which in a sense allows us to map a RocksDB sequence number to a Raft log index. We then ask the question: "what is the durable state of this key?". To answer this question, we can query this key using the ReadTier::kPersistedTier option in ReadOptions.read_tier. This instructs the lookup to skip the memtable and only read from the LSM, which is the exact set of semantics that we want.

Using this technique, we can then ask the question about the highest applied Raft log index for a particular Replica. We can then use this, in combination with recent improvements in #34660, to perform Raft log truncation only up to durable applied log indexes. To do this, the raftLogQueue on a Replica would simply query the RangeAppliedState.raft_applied_index using ReadTier::kPersistedTier and bound the index it can truncate to up to this value. Similarly, crash recovery could use the same strategy to know where it needs to reapply from. In the second case, explicitly using ReadTier::kPersistedTier might not be necessary because the memtable updates from before the crash will already be gone and the memtable will be empty.

For all of this to work, we need to ensure that we maintain the property that the application of a Raft log entry is always in the same RocksDB batch as the corresponding update to the RangeAppliedStateKey and that RockDB will always guarantee that these updates will atomically move from the memtable to the LSM. These properties should be true currently.

The solution fixes both of the performance problems listed above and should increase write throughput in CockroachDB. This solution also opens CockroachDB up to future flexibility with the Raft log storage engine. Even if it was initially moved to another RocksDB instance, it could be specialized and iterated on in isolation going forward.

Jira issue: CRDB-5635

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-storageRelating to our storage engine (Pebble) on-disk storage.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)C-performancePerf of queries or internals. Solution not expected to change functional behavior.T-kvKV TeamX-nostaleMarks an issue/pr that should be ignored by the stale bot

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions