-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: provide a way for replicas to re-enter the quota pool #82403
Description
Describe the problem
Replicas can get "ignored" by the quota pool under certain conditions. This primarily happens when a node restarts, as then we want to avoid stalling foreground traffic until the follower has caught up. However, there is no guarantee that the follower will ever catch up.
To Reproduce
I haven't actually done this, but if you take a write-heavy workload with a few hot ranges, take down a node for 1-2 minutes, then bring it up in a state in which it is slightly underprovisioned for the workload, it should forever lag behind, and the quota pool will not be helping it catch up.
Expected behavior
Hard to formulate! There are different regimes. If a follower is behind and is "hopelessly slow", foreground traffic shouldn't slow down in response to it (see #79215). But if it's only marginally slower (making "good progress"), and perhaps slower only because it is a read-only satellite in a faraway region, etc, we need to slowly bring it back into circulation or AOST reads on this replica will fail forever (and, if it's a voter, availability will remain compromised forever since the replica has to catch up before it can make forward progress).
Additional data / screenshots
Environment:
Additional context
I'm not sure we have struggled with this in practice, but it is a legitimate concern and becomes more important if, for #79215, take an approach where the quota pool "temporarily" ignores followers that are overloaded (and stops sending appends to them). These nodes will "intentionally" fall behind but nothing will ensure that they catch up when they have become healthy.
Jira issue: CRDB-16355