Skip to content

kvserver: fail fast when a range becomes unavailable#61311

Closed
lunevalex wants to merge 1 commit intocockroachdb:masterfrom
lunevalex:alex/fail-fast-unavailable-range
Closed

kvserver: fail fast when a range becomes unavailable#61311
lunevalex wants to merge 1 commit intocockroachdb:masterfrom
lunevalex:alex/fail-fast-unavailable-range

Conversation

@lunevalex
Copy link
Copy Markdown

Fixes #33007

This commit introduces a new Timer to the raft proposal write path.
If the timer has a timeout while waiting for a proposal to commit
and the range's appliedIndex has not moved in that time it will
assume the range is unavailable. This will kick off a new async
task to write a periodically write a dummy value to the range. Once a
write on the range succeeds it is marked as available.

Release note (performance improvement): if a range becomes unavilable,
all operations on that range will fail fast until it comes bacK up.

Fixes cockroachdb#33007

This commit introduces a new Timer to the raft proposal write path.
If the timer has a timeout while waiting for a proposal to commit
and the range's appliedIndex has not moved in that time it will
assume the range is unavailable. This will kick off a new async
task to write a periodically write a dummy value to the range. Once a
write on the range succeeds it is marked as available.

Release note (performance improvement): if a range becomes unavilable,
all operations on that range will fail fast until it comes bacK up.
@lunevalex lunevalex requested a review from tbg March 2, 2021 03:47
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad strokes look like what I expected, thanks for getting this out! I gave this some regular code review comments which I don't think are appropriate for the level of detail this PR operates at so feel free to disregard them for now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex)


pkg/base/constants.go, line 35 at r1 (raw file):

	// UnavailableRangeCheckThreshold is the amount of time to wait before checking
	// if a range is stuck and unavailable.
	UnavailableRangeCheckThreshold = 5 * time.Second

This should be a cluster setting, and zero should disable the feature.


pkg/keys/constants.go, line 216 at r1 (raw file):

		makeKey(LocalRangeLockTablePrefix, roachpb.Key(LockTableSingleKeyInfix).PrefixEnd()))

	LocalRangeProbeSuffix = roachpb.RKey("prbw")

Add a comment


pkg/kv/kvserver/client_replica_test.go, line 3839 at r1 (raw file):

	tc.WaitForValues(t, key, []int64{0, 0, 0})

	leader := tc.GetRaftLeader(t, roachpb.RKey(key))

Put that into the succeedssoon, or the flake risk is high. Also don't you want the leaseholder here? That's the one who proposes.


pkg/kv/kvserver/client_replica_test.go, line 3846 at r1 (raw file):

		return errors.Errorf("Expected replica %v to be marked as unavailable", leader)
	})

This would be the place to verify the failfasting (which btw I'd also expect your inc-value to get hit by the failfast which I don't think it is right now).


pkg/kv/kvserver/replica_write.go, line 280 at r1 (raw file):

		case <-slowTimer.C:
			slowTimer.Read = true
			r.store.metrics.SlowRaftRequests.Inc(1)

Should keep logging the old message.


pkg/kv/kvserver/replica_write.go, line 295 at r1 (raw file):

					if err := r.checkRangeAvailable(); err != nil {
						log.Errorf(ctx, "unexpected failure %v while checking if the range has become available", err)
					}

This doesn't look like it fails fast though? Shouldn't this now do what is done in the case of <-ctxDone below?


pkg/kv/kvserver/replica_write.go, line 830 at r1 (raw file):

			})
			err := contextutil.RunWithTimeout(ctx, "range-unavailable-checker-send", 60*time.Second, func(ctx context.Context) error {
				_, pErr := r.sendWithRangeID(ctx, r.RangeID, ba)

This feels like the wrong level for this check - and has weird behavior. For example, if you're on a follower, and there is no valid lease, then a proposal trying to get the lease might hang, then trigger this code, which would send another proposal that would then block on the original lease request. Which I think is fine, even, but it's ... complex? I think I'd rather have a way to make a proposal and put it right into raft to avoid looping back to a higher layer from a lower one, or at least that is my initial reaction. This seems fine for a prototype though!

@joshimhoff
Copy link
Copy Markdown
Collaborator

What a fun issue & PR.

This seems like something we should (eventually) export metrics about also? Those metrics would certainly be useful for debugging, as they hone in on cause. You know concretely you've got long proposal times without movement of the appliedIndex leading to failing fast. By the way: How often is that a symptom one observes when a range is unavailable for writes?

Okay, with that out of the way, if you want to go down the rabbit hole of should SRE alert on this, why, & how it connects to a kvprober that does writes, plz continue reading...

Would these metrics be a good alerting signal for SRE? I think the answer could be yes, so long as we can't think of anything the user can do to cause that symptom (other than overload the whole cluster). My guess is that contention can't cause long proposal times without movement of the appliedIndex? Could a hot range cause this specific symptom?

But why alert SRE on this if already alerting SRE on a kvprober that does writes (which covers a broader class of failures (e.g. kvclient is covered)? One possible benefit of such the additional alert would be faster detection time than with probing (probing requires time to hit the problem ranges). The con is more alerts / a more complex alerting stack. IMO it's good to minimize the number of alerts in the system (minimize the # of alerts AND maximize the # of metrics) given goals around coverage, speed of detection, etc. (& those goals are tied a reliability objective for the system being monitored).

This brings me to my last Q: Would the additional alert ever fire at a time when a kvprober that does writes would NOT fire? That would be an interesting result. I think the answer is NO as a kvprober write can't succeed without movement of the appliedIndex. Or perhaps more to the point, if fail fast is on for a range kvprober is obviously gonna error. So then the benefit of the additional alert is NOT more coverage, just faster detection time, & possibly much faster!

Also can't wait to learn how many things I said above that are right-ish (hopefully) but not quite right... saying true statements about KV is hard ;)

CC @andreimatei.

@tbg
Copy link
Copy Markdown
Member

tbg commented Mar 19, 2021

It's closely connected to the kvprober. In a sense, they are the same thing just differing in the range selection. kvprober as you know attempts to do pretty much "random", whereas this mechanism does it exactly where it's likely that there will be issues.

I would be a little bummed if SREs didn't alert on this mechanism, as it will dramatically cut down on the response time (the prober might take hours to get there...). We should work through what it would take to get that done as part of productionizing this.

@tbg
Copy link
Copy Markdown
Member

tbg commented Mar 19, 2021

By the way: How often is that a symptom one observes when a range is unavailable for writes?

Always, which is why it was chosen (for this prototype)

@joshimhoff
Copy link
Copy Markdown
Collaborator

joshimhoff commented Mar 19, 2021

I would be a little bummed if SREs didn't alert on this mechanism, as it will dramatically cut down on the response time (the prober might take hours to get there...).

Yup makes sense! Based on your response, seems like it has potential to both detect issues quickly & be very high signal to noise (truly unexpected issues only), the ideal alert.

In a sense, they are the same thing just differing in the range selection. kvprober as you know attempts to do pretty much "random", whereas this mechanism does it exactly where it's likely that there will be issues.

Mhm! Interesting frame.

Always, which is why it was chosen (for this prototype)

Well it's lower in the stack then kvprober; can't you have issues higher up in the stack leading to write unavailability that don't even hit this codepath? For example the distribution layer can't find the leaseholder or there is no leaseholder, etc. etc.

Not saying this is not the right check for the circuit break logic; just ^^ ties back to the Q of how kvprober & this relate.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered to tying the short-circuiting of requests (and possibly other down-stream effects like the alerting being discussed) directly to a per-proposal timer, but rather to network connectivity to a majority of replicas? Like, trigger a connectivity check on the timer, which check could be implemented by the RPCContext (which is in charge of maintaining connections). That would seem to me like a more stable system which centralizes somewhat the healthy/non-healthy decisions: instead of having each range decide by itself, you have the RPCContext decide for each pair of nodes. This way even a range that doesn't have a proposal in-flight could conceivably be considered unavailable.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex)

@joshimhoff
Copy link
Copy Markdown
Collaborator

joshimhoff commented Mar 19, 2021

Have we considered to tying the short-circuiting of requests (and possibly other down-stream effects like the alerting being discussed) directly to a per-proposal timer, but rather to network connectivity to a majority of replicas?

I am not so bold as to give an opinion about the short-circuit Q, but I'd rather alert SRE on the actual bad result happening (a write IS stuck), rather than a specific thing like network issues that might be the cause of the bad result (the network is "bad"). I expect better signal & better coverage from the former (a write IS stuck) than the latter (the network is "bad"), as (i) it's hard to define "bad" network & (ii) (prob more importantly) there are causes to unavailability other than the network.

BUT wouldn't it be great if CRDB / metrics exported by CRDB could tell us (i) a write to range X IS stuck AND (ii) the likely cause IS a "bad" network. So I like the idea but view the benefit as being about better obserability, not better alerting. Do you see what I mean, Andrei?

This way even a range that doesn't have a proposal in-flight could conceivably be considered unavailable.

I can see this as being the counter argument to what I'm saying but I think we should focus on getting to a state where we reliably alert on actually unavailability before we alert on specific causes that could lead to unavailability later.

@tbg
Copy link
Copy Markdown
Member

tbg commented Mar 19, 2021

A signal coming from this PR should always imply that the prober would also log a failure if it happened to chose this range. The prober will also squarely fall over if the network is bad. In my view, the right way to find out if the network is bad is checking something specific to the network, such as probing the bidirectional connectivity to all other members of the cluster.

The specific check was also inspired by a customer incident: if the network is fine and the range is still stuck, we want to know. We want to specifically short-circuit the range if it is unavailable. The check was intentionally chosen to be agnostic as to the "why".

@tbg
Copy link
Copy Markdown
Member

tbg commented Mar 19, 2021

Well it's lower in the stack then kvprober; can't you have issues higher up in the stack leading to write unavailability that don't even hit this codepath? For example the distribution layer can't find the leaseholder or there is no leaseholder, etc. etc.

Yes, this PR is lower-level. If DistSender doesn't route anything, this PR won't fire but the kvprober will.

@andreimatei
Copy link
Copy Markdown
Contributor

The specific check was also inspired by a customer incident: if the network is fine and the range is still stuck, we want to know. We want to specifically short-circuit the range if it is unavailable. The check was intentionally chosen to be agnostic as to the "why".

We want to know when a request is "stuck", but I don't know if that necessarily means that we want to start short-circuiting other requests to the range unless we know what's going on. I'm not sure we should aim to protect against arbitrary bugs. Also, with a 5s timeout, I'm afraid this will be too trigger happy.
But having said this, I'm not actually sure that we can pull of lifting the unavailability detection to the RPCContext given the dynamicity of the Raft group.

@tbg
Copy link
Copy Markdown
Member

tbg commented Mar 19, 2021 via email

@knz
Copy link
Copy Markdown
Contributor

knz commented Apr 22, 2021

This work will likely inform #53410.

tbg added a commit to tbg/cockroach that referenced this pull request Nov 9, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Dec 23, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Jan 3, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Jan 5, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
craig bot pushed a commit that referenced this pull request Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg

Fixes #33007.
Closes #61311.

This PR uses the circuit breaker package introduced in #73641 to address
issue #33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in #72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see #74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); #74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (#74713) and metrics (#74505).

The release note is deferred to #74705 (where breakers are turned on).

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this in 6664d0c Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: circuit-break requests to unavailable ranges

6 participants