Skip to content

kvflowcontrol,admission: productionize replication admission control #98703

@irfansharif

Description

@irfansharif

Is your feature request related to a problem? Please describe.

Tracking issue to productionize #95563 and rolling it out into the wild (enabled by default, made safe-to-opt-into for production clusters):

  • Merge #98308, which integrate various kvflowcontrol,admission components end-to-end gated by cluster settings.
    • Support a "flow token tracking only" mode where we do end-to-end flow control token tracking but don't actually block at admit time due to lack of requisite flow tokens. It'll let us look at production systems and understand that we are losing performance isolation due to a lack of write flow control.
    • Support a "flow tokens only for elastic traffic" mode, to use flow control only for elastic traffic (index backfills, etc).
    • Backport as disabled-entirely to 23.1 release branch.
  • Add randomized/integration testing to verify we don't leak flow tokens, leakage that could result in complete write throughput collapse. We want to test all the interactions listed here, which include the raft transport stream breaking, nodes crashing, followers being paused/unpaused, caught up via snapshots or post-restart log appends, leaseholder/leadership changes, prolonged leaseholder != leader scenarios, replicas being GC-ed, command reproposals, lossy raft transport, ranges splitting/merging, log truncations, and raft membership changes.
  • Add roachtest(s) to quantify the impact of index backfills with and without replication admission control, and make sure we don't regress.
  • Enable "flow tokens for {regular,elastic} traffic" on 23.2 master.
  • Monitor and address CI fallout for two-ish weeks on master. Backport any bug fixes to 23.1 (where it's disabled by default).
  • Roll out the "flow tokens only for elastic traffic" to test-only/POC 23.1 clusters for actual clients, on CC or otherwise.
  • Roll out the "flow token tracking only" mode (described above) to 23.1 CC clusters.

Jira issue: CRDB-25455

Epic CRDB-25348

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-admission-controlC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions