Skip to content

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

@irfansharif

Description

@irfansharif

Is your feature request related to a problem? Please describe.

We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in #96521 + #95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.

This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. #80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".

Describe the solution you'd like

// I11. What happens when a node is restarted and is being caught up rapidly
// through raft log appends? We know of cases where the initial log appends
// and subsequent state machine application be large enough to invert the
// LSM[^9]. Imagine large block writes with a uniform key distribution; we
// may persist log entries rapidly across many replicas (without inverting
// the LSM, so follower pausing is also of no help) and during state
// machine application, create lots of overlapping files/sublevels in L0.
// - We want to pace the initial rate of log appends while factoring in the
// effect of the subsequent state machine application on L0 (modulo [^9]). We
// can use flow tokens for this too. In I3a we outlined how for quorum writes
// that includes a replica on some recently re-started node, we need to wait
// for it to be sufficiently caught before deducting/blocking for flow tokens.
// Until that point we can use flow tokens on sender nodes that wish to send
// catchup MsgApps to the newly-restarted node. Similar to the steady state,
// flow tokens are only be returned once log entries are logically admitted
// (which takes into account any apply-time write amplification, modulo [^9]).
// Once the node is sufficiently caught up with respect to all its raft logs,
// it can transition into the mode described in I3a where we deduct/block for
// flow tokens for subsequent quorum writes.

// [^9]: With async raft storage writes (#17500, etcd-io/raft#8), we can
// decouple raft log appends and state machine application (see #94854 and
// #94853). So we could append at a higher rate than applying. Since
// application can be arbitrarily deferred, we cause severe LSM
// inversions. Do we want some form of pacing of log appends then,
// relative to observed state machine application? Perhaps specifically in
// cases where we're more likely to append faster than apply, like node
// restarts. We're likely to defeat AC's IO control otherwise.
// - For what it's worth, this "deferred application with high read-amp"
// was also a problem before async raft storage writes. Consider many
// replicas on an LSM, all of which appended a few raft log entries
// without applying, and at apply time across all those replicas, we end
// up inverting the LSM.
// - Since we don't want to wait below raft, one way bound the lag between
// appended entries and applied ones is to only release flow tokens for
// an entry at position P once the applied state position >= P - delta.
// We'd have to be careful, if we're not applying due to quorum loss
// (as a result of remote node failure(s)), we don't want to deplete
// flow tokens and cause interference on other ranges.
// - If this all proves too complicated, we could just not let state
// machine application get significantly behind due to local scheduling
// reasons by using the same goroutine to do both async raft log writes
// and state machine application.

This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.

Jira issue: CRDB-25460

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-admission-controlC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-admission-controlAdmission Control

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions