Skip to content

storage: snapshots should not include Raft log #34287

@tbg

Description

@tbg

In #34269, we had an issue with snapshots being rejected because they contained an overly large Raft log. The Raft log can get large if there is a problem with the log truncation queue. In effect, by having to reject snapshots based on log size, we're creating an unfortunate dependency between the two queues where problems with one have caused problems in the other.

A Raft snapshot contains the replicated data as of some log index N. In addition to some more metadata, it can contain past and "future" log entries. Future log entries don't have to be sent, though it makes sense to send them along at last up to some size so that the follower catches up faster post-snapshot. The past log entries are less easy to justify: the snapshot already reflects them. Their only utility is that should the node receiving the snapshot step up to be the leader very soon, it could catch up other followers who are still in need of recent log entries. This isn't a good justification as that is a a very unusual scenario.

I think that we include the past Raft log entries for technical reasons only. The RaftTruncatedState (which essentially remembers the first index in the log) is a replicated key for historical reasons

cockroach/pkg/keys/keys.go

Lines 908 to 911 in a05ee7b

// RaftTruncatedStateKey returns a system-local key for a RaftTruncatedState.
func (b RangeIDPrefixBuf) RaftTruncatedStateKey() roachpb.Key {
return append(b.replicatedPrefix(), LocalRaftTruncatedStateSuffix...)
}

and this forces us to provide it with the snapshot and also send all log entries that the truncated state promises are still around.

But this limitation can be refactored out. We make the truncated state key unreplicated, at which point replicas are free to truncate their logs to whatever past index they deem possible. Ordinarily truncations would still be triggered by the Raft log queue, though now additionally we're free to send out snapshots that contain no past log entries, and to synthesize a corresponding truncated state.

To migrate the key out of the replicated keyspace, we start using the new key once a corresponding cluster version is reached (atomically deleting the old key during queue-triggered log truncations, which go through the Raft log). Reading the truncated state queries both locations and uses the maximum. This doesn't guarantee that the key couldn't continue to exist in perpetuity on some ranges that never see a log truncation, but that doesn't matter; the code that uses it can go away one release later.

Touches #31947

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions