kvserver: fail fast when a range becomes unavailable#61311
kvserver: fail fast when a range becomes unavailable#61311lunevalex wants to merge 1 commit intocockroachdb:masterfrom
Conversation
Fixes cockroachdb#33007 This commit introduces a new Timer to the raft proposal write path. If the timer has a timeout while waiting for a proposal to commit and the range's appliedIndex has not moved in that time it will assume the range is unavailable. This will kick off a new async task to write a periodically write a dummy value to the range. Once a write on the range succeeds it is marked as available. Release note (performance improvement): if a range becomes unavilable, all operations on that range will fail fast until it comes bacK up.
tbg
left a comment
There was a problem hiding this comment.
The broad strokes look like what I expected, thanks for getting this out! I gave this some regular code review comments which I don't think are appropriate for the level of detail this PR operates at so feel free to disregard them for now.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lunevalex)
pkg/base/constants.go, line 35 at r1 (raw file):
// UnavailableRangeCheckThreshold is the amount of time to wait before checking // if a range is stuck and unavailable. UnavailableRangeCheckThreshold = 5 * time.Second
This should be a cluster setting, and zero should disable the feature.
pkg/keys/constants.go, line 216 at r1 (raw file):
makeKey(LocalRangeLockTablePrefix, roachpb.Key(LockTableSingleKeyInfix).PrefixEnd())) LocalRangeProbeSuffix = roachpb.RKey("prbw")
Add a comment
pkg/kv/kvserver/client_replica_test.go, line 3839 at r1 (raw file):
tc.WaitForValues(t, key, []int64{0, 0, 0}) leader := tc.GetRaftLeader(t, roachpb.RKey(key))
Put that into the succeedssoon, or the flake risk is high. Also don't you want the leaseholder here? That's the one who proposes.
pkg/kv/kvserver/client_replica_test.go, line 3846 at r1 (raw file):
return errors.Errorf("Expected replica %v to be marked as unavailable", leader) })
This would be the place to verify the failfasting (which btw I'd also expect your inc-value to get hit by the failfast which I don't think it is right now).
pkg/kv/kvserver/replica_write.go, line 280 at r1 (raw file):
case <-slowTimer.C: slowTimer.Read = true r.store.metrics.SlowRaftRequests.Inc(1)
Should keep logging the old message.
pkg/kv/kvserver/replica_write.go, line 295 at r1 (raw file):
if err := r.checkRangeAvailable(); err != nil { log.Errorf(ctx, "unexpected failure %v while checking if the range has become available", err) }
This doesn't look like it fails fast though? Shouldn't this now do what is done in the case of <-ctxDone below?
pkg/kv/kvserver/replica_write.go, line 830 at r1 (raw file):
}) err := contextutil.RunWithTimeout(ctx, "range-unavailable-checker-send", 60*time.Second, func(ctx context.Context) error { _, pErr := r.sendWithRangeID(ctx, r.RangeID, ba)
This feels like the wrong level for this check - and has weird behavior. For example, if you're on a follower, and there is no valid lease, then a proposal trying to get the lease might hang, then trigger this code, which would send another proposal that would then block on the original lease request. Which I think is fine, even, but it's ... complex? I think I'd rather have a way to make a proposal and put it right into raft to avoid looping back to a higher layer from a lower one, or at least that is my initial reaction. This seems fine for a prototype though!
|
What a fun issue & PR. This seems like something we should (eventually) export metrics about also? Those metrics would certainly be useful for debugging, as they hone in on cause. You know concretely you've got long proposal times without movement of the appliedIndex leading to failing fast. By the way: How often is that a symptom one observes when a range is unavailable for writes? Okay, with that out of the way, if you want to go down the rabbit hole of should SRE alert on this, why, & how it connects to a kvprober that does writes, plz continue reading... Would these metrics be a good alerting signal for SRE? I think the answer could be yes, so long as we can't think of anything the user can do to cause that symptom (other than overload the whole cluster). My guess is that contention can't cause long proposal times without movement of the appliedIndex? Could a hot range cause this specific symptom? But why alert SRE on this if already alerting SRE on a kvprober that does writes (which covers a broader class of failures (e.g. kvclient is covered)? One possible benefit of such the additional alert would be faster detection time than with probing (probing requires time to hit the problem ranges). The con is more alerts / a more complex alerting stack. IMO it's good to minimize the number of alerts in the system (minimize the # of alerts AND maximize the # of metrics) given goals around coverage, speed of detection, etc. (& those goals are tied a reliability objective for the system being monitored). This brings me to my last Q: Would the additional alert ever fire at a time when a kvprober that does writes would NOT fire? That would be an interesting result. I think the answer is NO as a kvprober write can't succeed without movement of the appliedIndex. Or perhaps more to the point, if fail fast is on for a range kvprober is obviously gonna error. So then the benefit of the additional alert is NOT more coverage, just faster detection time, & possibly much faster! Also can't wait to learn how many things I said above that are right-ish (hopefully) but not quite right... saying true statements about KV is hard ;) CC @andreimatei. |
|
It's closely connected to the kvprober. In a sense, they are the same thing just differing in the range selection. kvprober as you know attempts to do pretty much "random", whereas this mechanism does it exactly where it's likely that there will be issues. I would be a little bummed if SREs didn't alert on this mechanism, as it will dramatically cut down on the response time (the prober might take hours to get there...). We should work through what it would take to get that done as part of productionizing this. |
Always, which is why it was chosen (for this prototype) |
Yup makes sense! Based on your response, seems like it has potential to both detect issues quickly & be very high signal to noise (truly unexpected issues only), the ideal alert.
Mhm! Interesting frame.
Well it's lower in the stack then kvprober; can't you have issues higher up in the stack leading to write unavailability that don't even hit this codepath? For example the distribution layer can't find the leaseholder or there is no leaseholder, etc. etc. Not saying this is not the right check for the circuit break logic; just ^^ ties back to the Q of how kvprober & this relate. |
andreimatei
left a comment
There was a problem hiding this comment.
Have we considered to tying the short-circuiting of requests (and possibly other down-stream effects like the alerting being discussed) directly to a per-proposal timer, but rather to network connectivity to a majority of replicas? Like, trigger a connectivity check on the timer, which check could be implemented by the RPCContext (which is in charge of maintaining connections). That would seem to me like a more stable system which centralizes somewhat the healthy/non-healthy decisions: instead of having each range decide by itself, you have the RPCContext decide for each pair of nodes. This way even a range that doesn't have a proposal in-flight could conceivably be considered unavailable.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lunevalex)
I am not so bold as to give an opinion about the short-circuit Q, but I'd rather alert SRE on the actual bad result happening (a write IS stuck), rather than a specific thing like network issues that might be the cause of the bad result (the network is "bad"). I expect better signal & better coverage from the former (a write IS stuck) than the latter (the network is "bad"), as (i) it's hard to define "bad" network & (ii) (prob more importantly) there are causes to unavailability other than the network. BUT wouldn't it be great if CRDB / metrics exported by CRDB could tell us (i) a write to range X IS stuck AND (ii) the likely cause IS a "bad" network. So I like the idea but view the benefit as being about better obserability, not better alerting. Do you see what I mean, Andrei?
I can see this as being the counter argument to what I'm saying but I think we should focus on getting to a state where we reliably alert on actually unavailability before we alert on specific causes that could lead to unavailability later. |
|
A signal coming from this PR should always imply that the prober would also log a failure if it happened to chose this range. The prober will also squarely fall over if the network is bad. In my view, the right way to find out if the network is bad is checking something specific to the network, such as probing the bidirectional connectivity to all other members of the cluster. The specific check was also inspired by a customer incident: if the network is fine and the range is still stuck, we want to know. We want to specifically short-circuit the range if it is unavailable. The check was intentionally chosen to be agnostic as to the "why". |
Yes, this PR is lower-level. If DistSender doesn't route anything, this PR won't fire but the kvprober will. |
We want to know when a request is "stuck", but I don't know if that necessarily means that we want to start short-circuiting other requests to the range unless we know what's going on. I'm not sure we should aim to protect against arbitrary bugs. Also, with a |
|
In the very case motivating this PR the range was unavailable for a reason
we previously did not have on the radar. The point of the circuit breaker
is to limit the blast radius in that case. A general mechanism makes sense
but we have to tune it well and it needs an escape hatch in case it does
fire unexpectedly.
…On Fri, Mar 19, 2021, 18:02 Andrei Matei ***@***.***> wrote:
The specific check was also inspired by a customer incident: if the
network is fine and the range is still stuck, we want to know. We want to
specifically short-circuit the range if it is unavailable. The check was
intentionally chosen to be agnostic as to the "why".
We want to *know* when a request is "stuck", but I don't know if that
necessarily means that we want to start short-circuiting other requests to
the range unless we know what's going on. I'm not sure we should aim to
protect against arbitrary bugs. Also, with a 5s timeout, I'm afraid this
will be too trigger happy.
But having said this, I'm not actually sure that we can pull of lifting
the unavailability detection to the RPCContext given the dynamicity of
the Raft group.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#61311 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZATDWC3EKLDJH5TFO3TEN7RVANCNFSM4YN4OMEA>
.
|
|
This work will likely inform #53410. |
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the "slowness" existing detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is off via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). Primarily though, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are - UI work (cockroachdb#74713) - Metrics (cockroachdb#74505) The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg Fixes #33007. Closes #61311. This PR uses the circuit breaker package introduced in #73641 to address issue #33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in #72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see #74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); #74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (#74713) and metrics (#74505). The release note is deferred to #74705 (where breakers are turned on). Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Fixes #33007
This commit introduces a new Timer to the raft proposal write path.
If the timer has a timeout while waiting for a proposal to commit
and the range's appliedIndex has not moved in that time it will
assume the range is unavailable. This will kick off a new async
task to write a periodically write a dummy value to the range. Once a
write on the range succeeds it is marked as available.
Release note (performance improvement): if a range becomes unavilable,
all operations on that range will fail fast until it comes bacK up.