kvflowcontrol,admission: use flow control during raft log catchup post node-restart

**Is your feature request related to a problem? Please describe.**

We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in https://github.com/cockroachdb/cockroach/issues/96521 + https://github.com/cockroachdb/cockroach/issues/95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.

This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. https://github.com/cockroachdb/cockroach/issues/80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".

**Describe the solution you'd like**

https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L310-L328

https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L353-L375

This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.

Jira issue: CRDB-25460


	// I11. What happens when a node is restarted and is being caught up rapidly
	// through raft log appends? We know of cases where the initial log appends
	// and subsequent state machine application be large enough to invert the
	// LSM[^9]. Imagine large block writes with a uniform key distribution; we
	// may persist log entries rapidly across many replicas (without inverting
	// the LSM, so follower pausing is also of no help) and during state
	// machine application, create lots of overlapping files/sublevels in L0.
	// - We want to pace the initial rate of log appends while factoring in the
	// effect of the subsequent state machine application on L0 (modulo [^9]). We
	// can use flow tokens for this too. In I3a we outlined how for quorum writes
	// that includes a replica on some recently re-started node, we need to wait
	// for it to be sufficiently caught before deducting/blocking for flow tokens.
	// Until that point we can use flow tokens on sender nodes that wish to send
	// catchup MsgApps to the newly-restarted node. Similar to the steady state,
	// flow tokens are only be returned once log entries are logically admitted
	// (which takes into account any apply-time write amplification, modulo [^9]).
	// Once the node is sufficiently caught up with respect to all its raft logs,
	// it can transition into the mode described in I3a where we deduct/block for
	// flow tokens for subsequent quorum writes.

	// [^9]: With async raft storage writes (#17500, etcd-io/raft#8), we can
	// decouple raft log appends and state machine application (see #94854 and
	// #94853). So we could append at a higher rate than applying. Since
	// application can be arbitrarily deferred, we cause severe LSM
	// inversions. Do we want some form of pacing of log appends then,
	// relative to observed state machine application? Perhaps specifically in
	// cases where we're more likely to append faster than apply, like node
	// restarts. We're likely to defeat AC's IO control otherwise.
	// - For what it's worth, this "deferred application with high read-amp"
	// was also a problem before async raft storage writes. Consider many
	// replicas on an LSM, all of which appended a few raft log entries
	// without applying, and at apply time across all those replicas, we end
	// up inverting the LSM.
	// - Since we don't want to wait below raft, one way bound the lag between
	// appended entries and applied ones is to only release flow tokens for
	// an entry at position P once the applied state position >= P - delta.
	// We'd have to be careful, if we're not applying due to quorum loss
	// (as a result of remote node failure(s)), we don't want to deplete
	// flow tokens and cause interference on other ranges.
	// - If this all proves too complicated, we could just not let state
	// machine application get significantly behind due to local scheduling
	// reasons by using the same goroutine to do both async raft log writes
	// and state machine application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions