-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: throttle writes on followers #79215
Copy link
Copy link
Closed
Labels
A-admission-controlA-kv-distributionRelating to rebalancing and leasing.Relating to rebalancing and leasing.A-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.For issues SRE opened or otherwise cares about tracking.
Description
- roachtest repro | roachtest: exercise follower replica IO overload #81834 |
- prototype short-term mitigations | admission: experiment with quota pool awareness of IO overload on remote stores #82132 kvserver: prototype dropping incoming raft messages based on bytes budget #79752 |
- productionize short-term mitigations | TBD |
- issues to address | kvserver: ignore draining nodes in proposal quota #55806 kvserver: improve quota pool metrics #75978 kvserver: remove below-raft throttling #57247 kvserver: raft receive queue may OOM under overload #71805 kvserver: provide escape hatch for per-replica proposal quota pool #77251 kvserver: quota pool observability #79756 |
- future work | kvserver: e2e flow control for raft messages #79755 kvserver: unbounded memory use when falling behind on sideloaded MsgApp #73376 kvserver: provide a way for replicas to re-enter the quota pool #82403 |
We currently throttle writes on the receiving store based on store health (e.g. via admission control or via specialized AddSSTable throttling). However, this only takes into account the local store health, and not the associated write cost on followers during replication, which isn't always throttled. We've seen this lead to hotspots where follower stores get overwhelmed, since the follower writes bypass admission control. A similar problem exists with snapshot application.
This has been touched on in several other issues as well:
- admission,kv,bulk: unify (local) store overload protection via admission control #75066
- admission: graceful degradation #82114
Jira issue: CRDB-14642
Epic CRDB-15069
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-admission-controlA-kv-distributionRelating to rebalancing and leasing.Relating to rebalancing and leasing.A-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.For issues SRE opened or otherwise cares about tracking.