Skip to content

kvserver: snapshot ingestion rates are not bounded #74694

@irfansharif

Description

@irfansharif

Is your feature request related to a problem? Please describe.

We have cluster settings that control how quickly rebalance/recovery snapshots are generated: kv.snapshot_{rebalance,recovery}.max_rate. What we're missing is any sort of control for how quickly snapshots can be ingested. For even medium sized clusters (say, 50 nodes), our default recovery rates of 32 MB/s, various cluster events (decommissioning/outages/zone config changes) could end up hammering specific nodes with snapshots much quicker than it's equipped to handle. We saw evidence of this in a recent escalation; this can manifest as burst of writes to the node's underlying LSM possibly inverting it. The inversion can then lead to high read-amps, disruptive to the node's handling of foreground traffic.

Describe the solution you'd like

It seems bad that a node is at mercy of sender rates, which if sufficiently high, can end up toppling it quite easily. Some form of crude receiver side throttling could perhaps help with flow control. Few options:

  • through a symmetric kv.snapshot_{rebalance,recovery}_receiver.max_rate;
  • (from @nvanbenschoten) snapshot ingestion is not subject to Pebble’s PreIngestDelay like (above and below Raft) AddSSTable is. I don’t think that was intentional and it may serve as a kind of snapshot receiver-side throttling;
  • could be something we consider integrating into admission control.

+cc @cockroachdb/kv-notifications.

Jira issue: CRDB-12214

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-distributionRelating to rebalancing and leasing.A-kv-recoveryC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions