-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: snapshot ingestion rates are not bounded #74694
Description
Is your feature request related to a problem? Please describe.
We have cluster settings that control how quickly rebalance/recovery snapshots are generated: kv.snapshot_{rebalance,recovery}.max_rate. What we're missing is any sort of control for how quickly snapshots can be ingested. For even medium sized clusters (say, 50 nodes), our default recovery rates of 32 MB/s, various cluster events (decommissioning/outages/zone config changes) could end up hammering specific nodes with snapshots much quicker than it's equipped to handle. We saw evidence of this in a recent escalation; this can manifest as burst of writes to the node's underlying LSM possibly inverting it. The inversion can then lead to high read-amps, disruptive to the node's handling of foreground traffic.
Describe the solution you'd like
It seems bad that a node is at mercy of sender rates, which if sufficiently high, can end up toppling it quite easily. Some form of crude receiver side throttling could perhaps help with flow control. Few options:
- through a symmetric
kv.snapshot_{rebalance,recovery}_receiver.max_rate; - (from @nvanbenschoten) snapshot ingestion is not subject to Pebble’s PreIngestDelay like (above and below Raft) AddSSTable is. I don’t think that was intentional and it may serve as a kind of snapshot receiver-side throttling;
- could be something we consider integrating into admission control.
+cc @cockroachdb/kv-notifications.
Jira issue: CRDB-12214