kvserver: replace circuit breaker cancel goroutine with per-Replica registry#75224
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Feb 4, 2022
Merged
kvserver: replace circuit breaker cancel goroutine with per-Replica registry#75224craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Member
15b2d70 to
1c1f5d5
Compare
2c7d1cd to
8c06818
Compare
b3602a6 to
412a65a
Compare
erikgrinaker
approved these changes
Feb 2, 2022
Contributor
erikgrinaker
left a comment
There was a problem hiding this comment.
Looks great, only minor comments. Performance-wise, I think this is about as good as one can expect for an initial implementation, although I am a bit surprised that it shows up that prominently in kv95.
Member
Author
|
No need to re-review just yet, I'll ping you. |
…egistry PR cockroachdb#71806 originally landed with a very simple solution for the problem of joining the breaker's cancellation chain with that of the incoming context: spawning a goroutine. This was never meant to be the solution in prod (spawning goroutines is not cheap) and this PR switches to a real solution (which may still have soom room for improvement). Replica circuit breakers are still off (cockroachdb#74705) and so this change should not change any behavior, unless breakers are turned on. The provided benchmarks show both the "end-to-end" `Replica.Send` as well as the micro-bench `breaker.{R,Un}egister` perf and contrast it with that of having the breakers disabled. This sets the stage for evaluating and, if necessary, minimizing the overhead, which (along with some additional end-to-end testing) then allows us to turn breakers on by default (cockroachdb#74705). `BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of an no-result single-point read. We see that we lose a few percent, though this benchmark is pretty micro already. `BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring the overhead of making a cancel, and calling `Register()` followed by `Unregister()`. The "old" measurement is essentially just creating a cancelable context (so the +XXXX% numbers don't mean a lot; we're doing work for which there was previously no analog). Here we can clearly see that sharding the mutex map can help, though note that this would already only be visible at the `Replica.Send` level at much higher concurrencies than in this test (16). I also ran the kv95 roachtest with what corresponds to the `mutexmap-1` strategy above (i.e. a single mutex); [raw data here]. In this configuration (relative to disabling breakers) `kv95/enc=false/nodes=1` and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps, median over five runs). Once we've chosen a suitable shard count, I expect that to drop a bit as well, and think we can turn on breakers in production, at least from a performance point of view. [raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e old = disabled, new = enabled: ``` ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.63µs ± 4% 1.77µs ± 5% +8.38% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.66µs ± 4% 1.75µs ± 5% +5.74% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.63µs ± 6% 1.75µs ± 4% +7.48% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.58µs ± 7% 1.73µs ± 4% +9.47% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.57µs ± 4% 1.62µs ± 6% +3.19% (p=0.046 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.56µs ± 5% 1.63µs ± 7% +4.20% (p=0.007 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.53µs ± 5% 1.62µs ± 8% +6.01% (p=0.001 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.55µs ± 7% 1.61µs ± 5% +3.50% (p=0.049 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.57µs ± 7% 1.62µs ± 6% +3.53% (p=0.042 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.53µs ± 7% 1.64µs ± 5% +6.84% (p=0.000 n=11+12) name old alloc/op new alloc/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.42kB ± 0% 1.44kB ± 0% +1.25% (p=0.000 n=7+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.42kB ± 0% 1.44kB ± 0% +1.19% (p=0.000 n=9+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.42kB ± 0% 1.43kB ± 0% +1.24% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.42kB ± 0% 1.43kB ± 0% +1.08% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=10+10) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.42kB ± 0% 1.43kB ± 0% +1.12% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.42kB ± 0% 1.43kB ± 0% +1.09% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.41kB ± 0% 1.43kB ± 0% +1.13% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.41kB ± 0% 1.43kB ± 0% +1.23% (p=0.000 n=9+12) name old allocs/op new allocs/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ``` ``` ReplicaCircuitBreaker_Register/X/mutexmap-1-16 30.8ns ±21% 787.6ns ± 2% +2457.74% (p=0.000 n=11+11) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 31.6ns ± 3% 782.0ns ± 2% +2376.38% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 31.0ns ± 0% 778.8ns ± 2% +2409.61% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 31.0ns ± 0% 775.8ns ± 1% +2403.11% (p=0.000 n=10+10) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 31.0ns ± 1% 288.8ns ± 2% +833.06% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 31.1ns ± 1% 324.6ns ± 4% +945.05% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 31.1ns ± 0% 193.3ns ± 2% +522.26% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 31.0ns ± 0% 286.1ns ± 1% +822.80% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 31.0ns ± 0% 194.4ns ± 2% +527.12% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 31.0ns ± 1% 121.3ns ± 2% +291.01% (p=0.000 n=10+12) name old alloc/op new alloc/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) name old allocs/op new allocs/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ``` Created via: ``` $ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt $ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true' ``` For the record, I also tried a `sync.Map`, but it was vastly inferior for this use case as it is allocation-heavy and its writes are fairly slow. Touches cockroachdb#33007. Fixes cockroachdb#74707. Release note: None
Member
Author
|
RFAL. |
Member
Author
|
TFTR! bors r=erikgrinaker |
Contributor
|
Build succeeded: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #71806 originally landed with a very simple solution for the problem
of joining the breaker's cancellation chain with that of the incoming
context: spawning a goroutine. This was never meant to be the solution
in prod (spawning goroutines is not cheap) and this PR switches to a
real solution (which may still have soom room for improvement).
Replica circuit breakers are still off (#74705) and so this change
should not change any behavior, unless breakers are turned on. The
provided benchmarks show both the "end-to-end"
Replica.Sendas well asthe micro-bench
breaker.{R,Un}egisterperf and contrast it with thatof having the breakers disabled.
This sets the stage for evaluating and, if necessary, minimizing the
overhead, which (along with some additional end-to-end testing) then
allows us to turn breakers on by default (#74705).
BenchmarkReplicaCircuitBreakerSendOverheadmeasuresReplica.Sendofan no-result single-point read. We see that we lose a few percent,
though this benchmark is pretty micro already.
BenchmarkReplicaCircuitBreaker_Registeris even more micro, measuringthe overhead of making a cancel, and calling
Register()followed byUnregister(). The "old" measurement is essentially just creating acancelable context (so the +XXXX% numbers don't mean a lot; we're doing
work for which there was previously no analog). Here we can clearly see
that sharding the mutex map can help, though note that this would
already only be visible at the
Replica.Sendlevel at much higherconcurrencies than in this test (16).
I also ran the kv95 roachtest with what corresponds to the
mutexmap-1strategy above (i.e. a single mutex); raw data here. In this
configuration (relative to disabling breakers)
kv95/enc=false/nodes=1and
kv95/enc=false/nodes=1/cpu=32sustained a -1.62% hit (in qps,median over five runs). Once we've chosen a suitable shard count, I
expect that to drop a bit as well, and think we can turn on breakers in
production, at least from a performance point of view.
old = disabled, new = enabled:
Created via:
For the record, I also tried a
sync.Map, but it was vastly inferiorfor this use case as it is allocation-heavy and its writes are fairly
slow.
Touches #33007.
Fixes #74707.
Release note: None