kvcoord: add DistSender circuit breakers by erikgrinaker · Pull Request #118943 · cockroachdb/cockroach

erikgrinaker · 2024-02-08T12:57:57Z

This patch adds an initial implementation of DistSender replica circuit breakers. Their primary purpose is to prevent the DistSender getting stuck on non-functional replicas. In particular, the DistSender relies on receiving a NLHE from the replica to update its range cache and try other replicas, otherwise it will keep sending requests to the same broken replica which will continue to get stuck, giving the appearance of an unavailable range. This can happen if:

The replica stalls, e.g. with a disk stall or mutex deadlock.
Clients time out before the replica lease acquisition attempt times out, e.g. if the replica is partitioned away from the leader.

If a replica has returned only errors in the past few seconds, or hasn't returned any responses at all, the circuit breaker will probe the replica by sending a LeaseInfo request. This must either return success or a NLHE pointing to a leaseholder. Otherwise, the circuit breaker trips, and the DistSender will skip it for future requests, optionally also cancelling in-flight requests.

Currently, only replica-level circuit breakers are implemented. If a range is unavailable, the DistSender will continue to retry replicas as today. Range-level circuit breakers can be added later if needed, but are considered out of scope here.

Some follow-up work is needed:

Improve probe scalability. Currently, a goroutine is spawned per replica probe, which is likely too expensive at large scales. We should consider batching probes to nodes/stores, and using a bounded worker pool.
Consider follower read handling, e.g. by tracking the replica's closed timestamp and allowing requests that may still be served by it even if it's partitioned away from the leaseholder.
Improve observability, with metrics, tracing, and logging.
Comprehensive testing and benchmarking.

This will be addressed separately.

Resolves #105168.
Resolves #104262.
Resolves #81100.
Resolves #80713.
Epic: none
Release note (general change): gateways will now detect faulty or stalled replicas and use other replicas instead, which can prevent them getting stuck in certain cases (e.g. with disk stalls). This behavior can be disabled via the cluster setting kv.dist_sender.circuit_breaker.enabled.

cockroach-teamcity · 2024-02-08T12:58:05Z

This change is

erikgrinaker · 2024-02-12T12:54:37Z

I decided to remove the intermediate range-level tracking for now, and keep range-level circuit breakers out of scope -- it comes with a cost, and we should aggressively limit scope for 24.1 to the problem at hand (stalled/partitioned replicas). We can deal with range-level circuit breakers later.

erikgrinaker · 2024-02-14T12:56:25Z

I think this is starting to take shape. By default, we don't cancel in-flight requests when the breaker trips, only subsequent requests. Here are some rough benchmarks (single run) after some initial optimization, which show negligible overhead of about 110 ns per request in the common case, even at high concurrency. This incurs a single allocation due to a returned closure, which can possibly be eliminated.

BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1-24          	 9872906	       112.0 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=16-24         	 7917506	       142.0 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=64-24         	10722211	       114.4 ns/op	      48 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1024-24       	10988108	       110.6 ns/op	      48 B/op	       1 allocs/op

If we enable cancellation of in-flight requests, the cost increases substantially due to the additional context injection and tracking, especially at high concurrency. Some of this can be optimized away, but it gives us an upper bound on what this tracking would cost.

BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24           	 3737722	       308.9 ns/op	     144 B/op	       3 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24          	 1580942	       770.1 ns/op	     144 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24          	 1486292	       810.9 ns/op	     144 B/op	       3 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24        	 1458637	       841.4 ns/op	     144 B/op	       3 allocs/op

For comparison, BenchmarkKV/Scan/Native/rows=1 takes about 41,000 ns/op, so this adds 0.3% and 2% overhead for the non-cancellation and cancellation cases respectively. On an end-to-end kv100 workload this overhead will likely be negligible.

We'll need to decide whether cancellation is necessary or not. It may be possible to get the cost under high concurrency down by about half (to ~400 ns/op), which would give about 1% overhead on BenchmarkKV.

erikgrinaker · 2024-02-17T11:52:00Z

Got rid of the closure allocation.

BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1-24          	14411810	        82.75 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=16-24         	 7982485	       149.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=64-24         	10309857	       111.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=false/alone=true/err=nil/conc=1024-24       	10863777	       113.9 ns/op	       0 B/op	       0 allocs/op

Might be possible to make the cancel tracking lock-free. However, I think WithCancel() always allocates, and there may not be a way to avoid that -- but a single allocation may also be an acceptable cost as long as we get rid of the mutex. Will revisit later.

BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1-24           	 4122973	       277.1 ns/op	      96 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=16-24          	 1587454	       763.2 ns/op	      96 B/op	       1 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=64-24          	 1487504	       802.7 ns/op	      96 B/op	       2 allocs/op
BenchmarkDistSenderCircuitBreakersTrack/cancel=true/alone=true/err=nil/conc=1024-24        	 1442265	       835.6 ns/op	      96 B/op	       2 allocs/op

blathers-crl · 2024-02-18T14:44:24Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

erikgrinaker · 2024-02-22T09:29:47Z

@nvanbenschoten I added the context timeout check we discussed, such that we won't set up a cancellable context if the context already has a timeout below the probe threshold + timeout.

I also enabled both the circuit breakers and cancellation by default.

nvb

Reviewed 1 of 3 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @erikgrinaker)

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 642 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, this was bugging me as well. The complexity here comes from the desire to have the invariant that stallSince == 0 when inflightReqs == 0. This is only because the semantics seemed simpler, but we don't really need it -- we can simply keep bumping stallSince on first requests and any response, and additionally require checking inflightReqs before considering the replica stalled. I've made that change.

I think the broader question we should answer is whether we should always enable cancellation. If we do, then we'll likely have to use a mutex anyway, and we can just do all of this stuff under lock.

This is much nicer now. Agreed on not enforcing invariants between separate atomic values when doing so isn't strictly necessary.

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 289 at r2 (raw file):

				key := cbKey{rangeID: cb.rangeID, replicaID: cb.desc.ReplicaID}
				_, ok := d.mu.replicas[key]
				delete(d.mu.replicas, key) // nolint:deferunlockcheck

Out of curiosity, why do these two lines need the nolint:deferunlockcheck? Why isn't this on the Unlock above?

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 295 at r2 (raw file):

				// will only be closed once.
				if ok {
					close(cb.closedC) // nolint:deferunlockcheck

If the breaker was removed and a new one was re-inserted under the same key, shouldn't we be closing the channel on the new breaker? In other words, should we either have logic like:

cb2, ok := d.mu.replicas[key]
if ok && (cb == cb2) {
    delete(d.mu.replicas, key)
    close(cb.closedC)
}

Or logic like:

cb2, ok := d.mu.replicas[key]
...
close(cb2.closedC)

If the latter, then maybe gc should be a list of cbKey structs to avoid confusion.

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 37 at r3 (raw file):

		"kv.dist_sender.circuit_breaker.enabled",
		"enable circuit breakers for failing or stalled replicas",
		true, // TODO(erikgrinaker): disable by default?

Minor suggestion: for your own sake, consider setting this to false until you get back from your week off, then switching it to true.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 289 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Out of curiosity, why do these two lines need the nolint:deferunlockcheck? Why isn't this on the Unlock above?

The linter gets confused by the periodic mutex release. It thinks this line is between an inline lock/unlock. The linter allows some operations between inline locks, but operations that aren't (e.g. function calls) needs the nolint exception.

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 295 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If the breaker was removed and a new one was re-inserted under the same key, shouldn't we be closing the channel on the new breaker? In other words, should we either have logic like:
cb2, ok := d.mu.replicas[key]
if ok && (cb == cb2) {
    delete(d.mu.replicas, key)
    close(cb.closedC)
}
Or logic like:
cb2, ok := d.mu.replicas[key]
...
close(cb2.closedC)
If the latter, then maybe gc should be a list of cbKey structs to avoid confusion.

Good call, thanks -- made gc []cbKey.

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 37 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Minor suggestion: for your own sake, consider setting this to false until you get back from your week off, then switching it to true.

Sure, done. The guidance is to merge all major changes before the focus period, which I guess means we should have this enabled on master by Monday, but since noone will be around to follow up on any fallout anyway I guess it's fine to leave it off for another week.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 37 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Sure, done. The guidance is to merge all major changes before the focus period, which I guess means we should have this enabled on master by Monday, but since noone will be around to follow up on any fallout anyway I guess it's fine to leave it off for another week.

Reverted to enabled by default, since we didn't merge this. Should be good to go now?

erikgrinaker · 2024-03-04T13:28:38Z

Verified that this recovers failover/non-system/deadlock/lease=expiration in less than 10 seconds.

This patch adds an initial implementation of DistSender replica circuit breakers. Their primary purpose is to prevent the DistSender getting stuck on non-functional replicas. In particular, the DistSender relies on receiving a NLHE from the replica to update its range cache and try other replicas, otherwise it will keep sending requests to the same broken replica which will continue to get stuck, giving the appearance of an unavailable range. This can happen if: - The replica stalls, e.g. with a disk stall or mutex deadlock. - Clients time out before the replica lease acquisition attempt times out, e.g. if the replica is partitioned away from the leader. If a replica has returned only errors in the past few seconds, or hasn't returned any responses at all, the circuit breaker will probe the replica by sending a `LeaseInfo` request. This must either return success or a NLHE pointing to a leaseholder. Otherwise, the circuit breaker trips, and the DistSender will skip it for future requests, optionally also cancelling in-flight requests. Currently, only replica-level circuit breakers are implemented. If a range is unavailable, the DistSender will continue to retry replicas as today. Range-level circuit breakers can be added later if needed, but are considered out of scope here. The circuit breakers are disabled by default for now. Some follow-up work is likely needed before they can be enabled by default: * Improve probe scalability. Currently, a goroutine is spawned per replica probe, which is likely too expensive at large scales. We should consider batching probes to nodes/stores, and using a bounded worker pool. * Consider follower read handling, e.g. by tracking the replica's closed timestamp and allowing requests that may still be served by it even if it's partitioned away from the leaseholder. * Improve observability, with metrics, tracing, and logging. * Comprehensive testing and benchmarking. This will be addressed separately. Epic: none Release note (general change): gateways will now detect faulty or stalled replicas and use other replicas instead, which can prevent them getting stuck in certain cases (e.g. with disk stalls). This behavior can be disabled via the cluster setting `kv.dist_sender.circuit_breaker.enabled`.

nvb

Reviewed 3 of 3 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)

nvb · 2024-03-04T22:18:06Z

Merging on behalf of Erik.

bors r+

craig · 2024-03-04T23:42:48Z

Build succeeded:

Bazel Essential CI (Cockroach)

erikgrinaker requested review from andrewbaptist and nvb February 8, 2024 12:57

erikgrinaker requested a review from a team as a code owner February 8, 2024 12:57

erikgrinaker requested a review from a team February 8, 2024 12:57

erikgrinaker marked this pull request as draft February 8, 2024 13:06

erikgrinaker mentioned this pull request Feb 8, 2024

kvserver: DistSender lease detection prototype #118755

Closed

erikgrinaker force-pushed the distsender-circuit-breaker branch 2 times, most recently from ff1bfdb to 91eb907 Compare February 12, 2024 12:49

erikgrinaker force-pushed the distsender-circuit-breaker branch from 91eb907 to 1a263ba Compare February 12, 2024 12:57

erikgrinaker mentioned this pull request Feb 12, 2024

kvserver: DistSender should detect lease expiration and redirect requests #105168

Closed

erikgrinaker force-pushed the distsender-circuit-breaker branch from 1a263ba to c523167 Compare February 12, 2024 16:37

erikgrinaker mentioned this pull request Feb 13, 2024

batcheval: don't take out latches for LeaseInfo #119130

Merged

erikgrinaker force-pushed the distsender-circuit-breaker branch 3 times, most recently from 5398b7c to e7e1a4c Compare February 14, 2024 12:38

erikgrinaker force-pushed the distsender-circuit-breaker branch from e7e1a4c to 47224e2 Compare February 17, 2024 11:43

erikgrinaker force-pushed the distsender-circuit-breaker branch 3 times, most recently from 6d7faf1 to ffffde2 Compare February 18, 2024 14:44

erikgrinaker force-pushed the distsender-circuit-breaker branch from ffffde2 to 2296f4e Compare February 18, 2024 16:19

This was referenced Feb 18, 2024

kvcoord: add Transport.Reset() #119349

Merged

util/circuit: allow nil event handler #119350

Merged

kvserver: add Replica.GetMutexForTesting() #119351

Merged

erikgrinaker force-pushed the distsender-circuit-breaker branch 2 times, most recently from 74e65c1 to d26a1be Compare February 18, 2024 17:50

erikgrinaker force-pushed the distsender-circuit-breaker branch from c749999 to f6730c1 Compare February 22, 2024 08:56

nvb reviewed Feb 22, 2024

View reviewed changes

erikgrinaker force-pushed the distsender-circuit-breaker branch from f6730c1 to 20444e0 Compare February 23, 2024 11:30

erikgrinaker requested a review from nvb February 23, 2024 11:31

erikgrinaker commented Feb 23, 2024

View reviewed changes

erikgrinaker force-pushed the distsender-circuit-breaker branch from 20444e0 to dc02c57 Compare March 4, 2024 12:05

erikgrinaker commented Mar 4, 2024

View reviewed changes

erikgrinaker mentioned this pull request Mar 4, 2024

kvcoord: don't probe idle replicas with tripped circuit breakers #119865

Merged

erikgrinaker force-pushed the distsender-circuit-breaker branch from dc02c57 to 34a44e8 Compare March 4, 2024 13:53

erikgrinaker force-pushed the distsender-circuit-breaker branch from 34a44e8 to 68a90a6 Compare March 4, 2024 14:01

erikgrinaker mentioned this pull request Mar 4, 2024

kvserver: disk stall prevents lease transfer #81100

Closed

nvb approved these changes Mar 4, 2024

View reviewed changes

cockroachdb deleted a comment from craig bot Mar 4, 2024

craig bot merged commit bf013ea into cockroachdb:master Mar 4, 2024

yuzefovich mentioned this pull request Mar 6, 2024

roachtest: tpch_concurrency/no_streamer failed #119983

Closed

erikgrinaker mentioned this pull request Mar 11, 2024

kvcoord: separate read/write tracking for DistSender circuit breakers #120195

Open

erikgrinaker mentioned this pull request Mar 25, 2024

logictest: [result is ambiguous: after X.XXs of attempting command: context canceled] TestLogic_upsert failed #120395

Closed

arulajmani mentioned this pull request May 8, 2024

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

Closed

Conversation

erikgrinaker commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Feb 8, 2024

Uh oh!

erikgrinaker commented Feb 12, 2024

Uh oh!

erikgrinaker commented Feb 14, 2024

Uh oh!

erikgrinaker commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Feb 18, 2024

Uh oh!

erikgrinaker commented Feb 22, 2024

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Mar 4, 2024

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Mar 4, 2024

Uh oh!

craig bot commented Mar 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

erikgrinaker commented Feb 8, 2024 •

edited

Loading

erikgrinaker commented Feb 17, 2024 •

edited

Loading