storage: replace CommandQueue with spanlatch.Manager#32865
storage: replace CommandQueue with spanlatch.Manager#32865craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
ajwerner
left a comment
There was a problem hiding this comment.
LGTM! excited to see the numbers
Reviewed 31 of 31 files at r2, 15 of 15 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 2400 at r3 (raw file):
} if beforeLatch != (time.Time{}) {
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
vilterp
left a comment
There was a problem hiding this comment.
LGTM on command queue page removal.
pkg/storage/metrics.go
Outdated
| @@ -924,8 +873,8 @@ var ( | |||
|
|
|||
| // Slow request metrics. | |||
| metaSlowCommandQueueRequests = metric.Metadata{ | |||
There was a problem hiding this comment.
nit: change this variable name
petermattis
left a comment
There was a problem hiding this comment.
I appreciate you going the extra mile in removing all references to command queue.
Reviewable status:
complete! 2 of 0 LGTMs obtained
pkg/server/status/health_check.go, line 53 at r3 (raw file):
"ranges.underreplicated": gaugeZero, "requests.backpressure.split": gaugeZero, "requests.slow.latch": gaugeZero,
That makes so much more sense.
pkg/storage/replica.go, line 2347 at r3 (raw file):
// With clockless reads, everything is treated as a write. clockless := r.store.Clock().MaxOffset() == timeutil.ClocklessMaxOffset
I thought we removed support for clockless mode. @tbg?
pkg/storage/replica_test.go, line 3405 at r3 (raw file):
// TestReplicaCommandQueuePrereqDebugSummary tests the debug summary logged // about a request's prerequisites when entering the command queue. func TestReplicaCommandQueuePrereqDebugSummary(t *testing.T) {
I assume all of these tests are being moved elsewhere.
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 3 of 0 LGTMs obtained
pkg/storage/replica.go, line 2347 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I thought we removed support for clockless mode. @tbg?
Might still be in the code, but you definitely don't have to add code that maintains it. Needs to be ripped out.
cockroachdb#31997 (review) Release note: None
This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in cockroachdb#31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. _### Benchmarks _#### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` _#### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` _#### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` _#### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Release note (performance improvement): Replace Replica latching mechanism with new optimized data structure that improves throughput, especially under heavy contention.
After this commit there are no references to the CommandQueue in the code base. No code changes. Release note: None
8162662 to
3c76b15
Compare
nvb
left a comment
There was a problem hiding this comment.
Thanks for all the reviews!
I appreciate you going the extra mile in removing all references to command queue.
Not really the extra mile, I just never want to hear that name again.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 3 stale)
pkg/storage/metrics.go, line 875 at r3 (raw file):
Previously, vilterp (Pete Vilter) wrote…
nit: change this variable name
Good catch, done.
pkg/storage/replica.go, line 2347 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Might still be in the code, but you definitely don't have to add code that maintains it. Needs to be ripped out.
Done.
Is there a discussion on why we removed clockless mode? Was it related to leases and their expiration?
pkg/storage/replica.go, line 2400 at r3 (raw file):
Previously, ajwerner wrote…
Nice, done.
pkg/storage/replica_test.go, line 3405 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I assume all of these tests are being moved elsewhere.
All of the tests that we still need were either already moved to the storage/spanlatch package or are renamed in this commit. This is deleting a few tests around command cancellation while waiting in the CommandQueue because the equivalent action is significantly simpler now that we no longer build explicit dependency trees. b81216e added a few new tests in storage/spanlatch that test exactly that.
32865: storage: replace CommandQueue with spanlatch.Manager r=nvanbenschoten a=nvanbenschoten This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in #31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. ### Benchmarks #### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` #### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` #### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` #### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in #31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks.
This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year.
Benchmarks
Standard Benchmarks
These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts.
Large-machine Benchmarks
These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs.
Motivating Benchmarks
These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the
cspecifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations.Batching Benchmarks
One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary.
It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck.