storage: snapshots should not include Raft log

In https://github.com/cockroachdb/cockroach/issues/34269, we had an issue with snapshots being rejected because they contained an overly large Raft log. The Raft log can get large if there is a problem with the log truncation queue. In effect, by having to reject snapshots based on log size, we're creating an unfortunate dependency between the two queues where problems with one have caused problems in the other.

A Raft snapshot contains the replicated data as of some log index N. In addition to some more metadata, it can contain past and "future" log entries. Future log entries don't have to be sent, though it makes sense to send them along at last up to some size so that the follower catches up faster post-snapshot. The past log entries are less easy to justify: the snapshot already reflects them. Their only utility is that should the node receiving the snapshot step up to be the leader very soon, it could catch up other followers who are still in need of recent log entries. This isn't a good justification as that is a a very unusual scenario.

I think that we include the past Raft log entries for technical reasons only. The RaftTruncatedState (which essentially remembers the first index in the log) is a replicated key for historical reasons

https://github.com/cockroachdb/cockroach/blob/a05ee7bf97580556544c061b228477013c2200d5/pkg/keys/keys.go#L908-L911

and this forces us to provide it with the snapshot and also send all log entries that the truncated state promises are still around.

But this limitation can be refactored out. We make the truncated state key unreplicated, at which point replicas are free to truncate their logs to whatever past index they deem possible. Ordinarily truncations would still be triggered by the Raft log queue, though now additionally we're free to send out snapshots that contain no past log entries, and to synthesize a corresponding truncated state.

To migrate the key out of the replicated keyspace, we start using the new key once a corresponding cluster version is reached (atomically deleting the old key during queue-triggered log truncations, which go through the Raft log). Reading the truncated state queries both locations and uses the maximum. This doesn't guarantee that the key couldn't continue to exist in perpetuity on some ranges that never see a log truncation, but that doesn't matter; the code that uses it can go away one release later.

Touches https://github.com/cockroachdb/cockroach/issues/31947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: snapshots should not include Raft log #34287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	// RaftTruncatedStateKey returns a system-local key for a RaftTruncatedState.
	func (b RangeIDPrefixBuf) RaftTruncatedStateKey() roachpb.Key {
	return append(b.replicatedPrefix(), LocalRaftTruncatedStateSuffix...)
	}

storage: snapshots should not include Raft log #34287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions