-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: consider additional triggers for replica circuit breakers #74712
Description
Touches #33007.
Is your feature request related to a problem? Please describe.
As of #71806, the per-replica circuit breaker trips when a proposal is in-flight for more than 15s. However, there are other scenarios in which tripping the probe may be worthwhile. For example, a read-only command may have deadlocked. It would thus never release its latches, possibly blocking all write traffic to the range.
The current trigger may also never fire. If there isn't a lease, and the client has a statement timeout of <15s throughout, lease proposals (which have their unique reproposal treatment, #74711) never stick around for long enough to trigger the breaker (if there is a lease, the actual client write will be in the proposals map and will stay there even if the client gives up, so in that case it works, or should, anyway; most clients are pipelined writes so it's the common case too, in some sense).
Another scenario is when a node goes non-live. This is an event associated with potential loss of quorum. Each Store could iterate through its replicas and, based on the active descriptor, consider tripping the breaker based on an expectation that this range (and thus replica) will now be unavailable. This will generally move affected replicas into the desirable fail-fast regime more efficiently; at the time of writing (#71806), replicas will fail-fast 15s after they have received a request that got stuck. When most ranges receive only sporadic writes, clients will see 15+s latencies for some time, which is not the UX we're ultimately after.
As more triggers are introduced, the risk of false positives (i.e. tripping the breaker of a healthy replica) increases. We could consider a semantic change: instead of outright tripping the breaker, the breaker's probe is triggered. If the probe passes - all is well, the breaker never trips. If the probe fails, it trips the breaker.
Related (figuring out what "slow" request means): #74509
Related (more coverage for the probe): #74701
Related (limits of per-Replica circuit breakers): #74616 #74503
Jira issue: CRDB-12225