kvserver: prevent truncating log entries needed for entry application

**Describe the problem**

- We do not seem to have appropriate safeguards in place that prevent log
truncations that remove committed-but-unapplied entries from the log.

**To Reproduce**

I have not reproduced this, nor is there evidence that it has occurred in
the wild. This is based on observations in #66843.

**Expected behavior**

Log truncations should only remove entries that are no longer required.  Our
log truncation has evolved over time. Originally, the truncated state index
(i.e. the last truncated log position) was part of the replicated state,
meaning that all replicas kept the same tail of the log around (as a function
of their applied index). This was loosened several releases ago to free us from
the requirement that the raft log be included in raft snapshots.

Log truncations are and were always triggered by the *Raft* leader through a queue:

https://github.com/cockroachdb/cockroach/blob/2e00964f6ba48f190929665c39933148cfaf8592/pkg/kv/kvserver/raft_log_queue.go#L194-L198

and the main concern is not cutting followers off from being able to obtain
the entire raft log. No thought was given to the fact that followers may not
have applied their acked portion of the raft log; in fact the raft leader
does not learn this information from the followers. In the short term, we
should allow followers to adjust the requested truncation to avoid deleting
entries that yet need to be applied.

We should generally transition off this mechanism and truncate the log in a
purely local fashion. This should also include performance optimizations;
at present, we block the raft handle loop and perform regular MVCC deletions
in a loop: 

https://github.com/cockroachdb/cockroach/blob/8f97dcfb1d1dbcc03114ab08b8f7af5243345e1d/pkg/kv/kvserver/replica_raft.go#L1845-L1852

However, only the update to the truncated state itself needs to be part of
the raft cycle. The actual deletion can be carried out asynchronously, and
additionally could benefit from using a combination of SingleDelete and
pebble range deletions (or even, gasp, range single deletions, but let's
not go there).



	// Is this the raft leader? We only perform log truncation on the raft leader
	// which has the up to date info on followers.
	if raftStatus.RaftState != raft.StateLeader {
	return truncateDecision{}, nil
	}

	for idx := oldTruncatedState.Index + 1; idx <= newTruncatedState.Index; idx++ {
	// NB: RangeIDPrefixBufs have sufficient capacity (32 bytes) to
	// avoid allocating when constructing Raft log keys (16 bytes).
	unsafeKey := prefixBuf.RaftLogKey(idx)
	if err := readWriter.ClearUnversioned(unsafeKey); err != nil {
	return false, errors.Wrapf(err, "unable to clear truncated Raft entries for %+v", newTruncatedState)
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: prevent truncating log entries needed for entry application #70109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: prevent truncating log entries needed for entry application #70109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions