admission,kvserver: introduce an elastic cpu limiter#86638
admission,kvserver: introduce an elastic cpu limiter#86638craig[bot] merged 5 commits intocockroachdb:masterfrom
Conversation
ddb5095 to
1994ca7
Compare
4b6aede to
08aab94
Compare
|
@sumeerbhola @andrewbaptist this is ready for a look. There are some TODOs (marked with |
08aab94 to
213ec01
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, 27 of 27 files at r5, 23 of 29 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andrewbaptist, @irfansharif, and @sumeerbhola)
pkg/kv/kvserver/store.go line 3815 at r7 (raw file):
} bypassAdmission := ba.IsAdmin()
There is also rearrangement of logic here. It looks correct to me, but this is also a part that has no existing tests (my fault).
Can you add a TODO to write a unit test for this -- there are enough cases that make me nervous. Fine to do it after the branch cut.
pkg/kv/kvserver/store.go line 3856 at r7 (raw file):
if admissionEnabled { defer func() { if retErr != nil {
does a return from this function that uses a different variable automatically populate retErr before the defer is run?
pkg/kv/kvserver/kvadmissionhandle/handle.go line 54 at r7 (raw file):
// ContextWithHandle returns a Context wrapping the supplied kvadmission handle. func ContextWithHandle(ctx context.Context, h Handle) context.Context { return context.WithValue(ctx, handleKey{}, h)
this is a memory allocation, yes?
Is this only being used plumb the ability to call OverElasticCPULimit? If yes, can we call this something like ContextWithElasticCPUWorkHandle and only do the allocation for those elastic cases. Those seem heavy weight enough for the allocation to not matter.
pkg/storage/mvcc.go line 5880 at r7 (raw file):
// prefer callers being able to use SSTs directly). Going over limit is // accounted for in admission control by penalizing the subsequent // request, so doing it slightly is find.
s/find/fine/
pkg/util/admission/elastic_cpu_granter.go line 182 at r7 (raw file):
tokens := e.requester.granted(noGrantChain) if tokens == 0 { return // requester didn't accept, nothing to do
don't we give back the 1 token that was consumed?
pkg/util/admission/elastic_cpu_granter.go line 191 at r7 (raw file):
// TODO(irfansharif): Provide separate enums for different elastic CPU token // sizes? (1ms, 10ms, 100ms). Write up something about picking the right value. // Can this value be auto-estimated?
(note to self) haven't looked at these metrics yet
pkg/util/admission/elastic_cpu_utilization_adjuster.go line 21 at r7 (raw file):
) var _ elasticCPUUtilizationAdjuster = &elasticCPUGranter{}
Is this interface being abstracted for testing?
Even if this is, it seems unnecessary to put this simple functionality in a separate file from elasticCPUGranter.
"Adjuster" suggests that this is the one making the decision based on load info, like kvSlotAdjuster. This is just a listener.
pkg/util/admission/elastic_cpu_utilization_adjuster.go line 51 at r7 (raw file):
float64(int64(runtime.GOMAXPROCS(0))*time.Second.Nanoseconds()) return e.getTargetUtilization() * float64(int64(totalElasticCPUTime)-availableElasticCPUTime.Nanoseconds()) / totalElasticCPUTime
I didn't understand what is going on in this function. We have some math that seems to be usedCPUTimeTokens/totalCPUTimeTokens. Are the numerator and denominator the tokens for 1s? Then the numerator is doing a subtraction: what invariant ensures that the results in not negative? Can the first value in a-b be small because we just recently reduced the target utilization, but b be large because we still have some burst tokens accumulated?
This could use code comments and a crisper statement of invariants.
pkg/util/admission/elastic_cpu_work_queue.go line 56 at r7 (raw file):
} e.metrics.AcquiredNanos.Inc(duration.Nanoseconds()) e.metrics.Acquisitions.Inc(1)
there is a workQueue metric for admitted count already.
pkg/util/admission/elastic_cpu_work_queue.go line 136 at r7 (raw file):
return true, grunning.Subtract(runningTime, h.allotted) } return false, grunning.Subtract(h.allotted, runningTime)
changing the subtraction order for the 2 cases is confusing. There should be a single meaning to what this return value means.
pkg/util/admission/grant_coordinator.go line 979 at r7 (raw file):
// SQL-level admission. All this informs why its structured as a separate grant // coordinator. type ElasticCPUGrantCoordinator struct {
I didn't expect this to see a new GrantCoordinator implementation. Even StoreGrantCoordinators is able to use a GrantCoordinator for KV write work with tokens. Using the normal implementation allows us to reuse a bunch of mediating code. In that KV write case we needed a way to break out of the abstraction at done time, which is why we "leaked" a granterWithStoreWriteDone to the StoreWorkQueue. In this elastic cpu case, I think the situation is even simpler, in that we can simply call granter.{returnGrant,tookWithoutPermission} at done time.
I do see why this works in the sense that you can intercept all the work in the granter implementation itself, and the granter implementation knows that it is being used in this context only (and not with a regular GrantCoordinator).
This is ok-ish for now. But please add a code comment stating how this implementation is breaking with the normal mode of things. The existing abstractions have some creaking aspects now, so they are worth revisiting.
pkg/util/admission/scheduler_latency_listener.go line 2 at r7 (raw file):
// Copyright 2022 The Cockroach Authors. //
haven't read the code in this file
pkg/util/goschedstats/latency.go line 70 at r7 (raw file):
} if len(newCBs)+1 != len(schedulerLatencyCallbackInfo.callbacks) { panic(errors.AssertionFailedf("unexpected unregister: new count %d, old count %d",
seems extreme to panic if someone tried to unregister with an non-existent id. Is this because unregister will not be called in production code? If so, add a code comment justifying this.
pkg/util/goschedstats/latency.go line 83 at r7 (raw file):
var schedulerLatencyCallbackInfo struct { mu syncutil.Mutex id int64
what is the purpose of this id?
pkg/util/goschedstats/latency.go line 91 at r7 (raw file):
type SchedulerLatencyStatsTicker struct { lastSample *metrics.Float64Histogram }
I don't quite understand the split between the code in server.go and the code here.
We have a single set of callbacks registered with the var declared above. That means all those callbacks are happy with a single sampling frequency, which is also declared as a cluster setting here. Can we expose a StartStatsTicker(*cluster.Settings, *stop.Stopper) function and hide all the ticking details in this file?
pkg/util/goschedstats/latency.go line 131 at r7 (raw file):
} func clone(h *metrics.Float64Histogram) *metrics.Float64Histogram {
all this histogram fiddling logic needs a unit test.
Ok if you want to take a TODO to do this after the branch cut.
pkg/util/goschedstats/latency.go line 143 at r7 (raw file):
func sub(a, b *metrics.Float64Histogram) *metrics.Float64Histogram { res := clone(a) for i := 0; i < len(res.Counts); i++ {
The bucket counts are guaranteed to not change in a running system?
213ec01 to
5d87ea1
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Flushing out some comments, review changes, and tests. There's still more tests I need to write.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @irfansharif, and @sumeerbhola)
pkg/kv/kvserver/store.go line 3815 at r7 (raw file):
Previously, sumeerbhola wrote…
There is also rearrangement of logic here. It looks correct to me, but this is also a part that has no existing tests (my fault).
Can you add a TODO to write a unit test for this -- there are enough cases that make me nervous. Fine to do it after the branch cut.
Done.
pkg/kv/kvserver/store.go line 3856 at r7 (raw file):
Previously, sumeerbhola wrote…
does a return from this function that uses a different variable automatically populate
retErrbefore the defer is run?
Yes, and it makes code structure slightly simpler.
pkg/kv/kvserver/kvadmissionhandle/handle.go line 54 at r7 (raw file):
Previously, sumeerbhola wrote…
this is a memory allocation, yes?
Is this only being used plumb the ability to callOverElasticCPULimit? If yes, can we call this something likeContextWithElasticCPUWorkHandleand only do the allocation for those elastic cases. Those seem heavy weight enough for the allocation to not matter.
Good idea, done. The context helpers now felt better placed near where ElasticCPUWorkHandle is defined (pkg/util/admission), making this kvadmissionhandle package redundant (only needed it to take on a dependency from pkg/storage). TL;DR this package is now gone and the handle type moves back to kvserver.
pkg/storage/mvcc.go line 5880 at r7 (raw file):
Previously, sumeerbhola wrote…
s/find/fine/
Done.
pkg/util/admission/elastic_cpu_granter.go line 182 at r7 (raw file):
Previously, sumeerbhola wrote…
don't we give back the 1 token that was consumed?
Done. Was trying to avoid introducing a variant of returnGrant that didn't in turn call tryGrant, but that's a bad reason.
pkg/util/admission/elastic_cpu_utilization_adjuster.go line 21 at r7 (raw file):
Previously, sumeerbhola wrote…
Is this interface being abstracted for testing?
Even if this is, it seems unnecessary to put this simple functionality in a separate file fromelasticCPUGranter.
"Adjuster" suggests that this is the one making the decision based on load info, likekvSlotAdjuster. This is just a listener.
Moved it to elastic_cpu_granter.go. And yes, for tests that I had yet to add.
pkg/util/admission/elastic_cpu_utilization_adjuster.go line 51 at r7 (raw file):
Previously, sumeerbhola wrote…
I didn't understand what is going on in this function. We have some math that seems to be
usedCPUTimeTokens/totalCPUTimeTokens. Are the numerator and denominator the tokens for 1s? Then the numerator is doing a subtraction: what invariant ensures that the results in not negative? Can the first value ina-bbe small because we just recently reduced the target utilization, butbbe large because we still have some burst tokens accumulated?
This could use code comments and a crisper statement of invariants.
Since this feeds into the following TODO, I'm just going to replace it instead in my next pass (tomorrow) on this PR:
// TODO(irfansharif): This setting is flawed. It can make for a very slow
// rise (the observed utilization value is not smoothed) or be altogether
// unreactive when operating at low elastic CPU % limits. Consider if the
// target utilization is at 1% in an 8vCPU machine. The burst capacity of
// the token bucket = 1% * 8s = 80ms. If we're relying on observing 90% of
// that value being in use, i.e. 72ms to have been acquired by elastic work,
// this is simply not possible when the smallest unit of acquisition is
// larger, say 100ms.
//
// We really only want this for one reason: only increase CPU allotment if
// there are active users of this quota that possibly benefit from a larger
// allotment. So we could instead make this conditional on there being >= 1
// waiters X% of time over some recent time window.
elasticCPUGranterUtilizationFractionForAdditionalCPU = settings.RegisterFloatSetting(
settings.SystemOnly,
"elastic_cpu_granter.utilization_fraction_for_additional_cpu",
"sets the minimum utilization of the current limit needed before increasing elastic CPU %",
0.9, // 90%
)
pkg/util/admission/elastic_cpu_work_queue.go line 56 at r7 (raw file):
Previously, sumeerbhola wrote…
there is a workQueue metric for admitted count already.
Removed. This was detritus from before when I wasn't using a work queue but still wanted to see admitted metrics
pkg/util/admission/elastic_cpu_work_queue.go line 136 at r7 (raw file):
Previously, sumeerbhola wrote…
changing the subtraction order for the 2 cases is confusing. There should be a single meaning to what this return value means.
Simplified.
pkg/util/admission/grant_coordinator.go line 979 at r7 (raw file):
Previously, sumeerbhola wrote…
I didn't expect this to see a new GrantCoordinator implementation. Even
StoreGrantCoordinatorsis able to use aGrantCoordinatorfor KV write work with tokens. Using the normal implementation allows us to reuse a bunch of mediating code. In that KV write case we needed a way to break out of the abstraction at done time, which is why we "leaked" agranterWithStoreWriteDoneto theStoreWorkQueue. In this elastic cpu case, I think the situation is even simpler, in that we can simply callgranter.{returnGrant,tookWithoutPermission}at done time.I do see why this works in the sense that you can intercept all the work in the
granterimplementation itself, and thegranterimplementation knows that it is being used in this context only (and not with a regular GrantCoordinator).
This is ok-ish for now. But please add a code comment stating how this implementation is breaking with the normal mode of things. The existing abstractions have some creaking aspects now, so they are worth revisiting.
I first tried reusing the existing GrantCoordinator but the code got too confusing and littered with edge cases. Consider: - grant chains (don't apply here), or
- the use of
tryGrant()(we only want to "grant forwarding" for a given work class and work kind", and not all APIs on the GrantCoordinator consider both.
Perhaps these are the creaking aspects you're talking about which made it feel easier to copy paste this mediation code. I'm also wondering whether we need to re-use this one concrete type as opposed to a grantCoordinator interface. But I also don't clearly see how to pull out such an interface. Added a TODO.
pkg/util/goschedstats/latency.go line 70 at r7 (raw file):
Previously, sumeerbhola wrote…
seems extreme to panic if someone tried to unregister with an non-existent id. Is this because unregister will not be called in production code? If so, add a code comment justifying this.
Downgraded the panic. The unregister is called in production code, it's just that only tests would typically add additional registrations -- normally there'd be just the one.
pkg/util/goschedstats/latency.go line 83 at r7 (raw file):
Previously, sumeerbhola wrote…
what is the purpose of this id?
To make tests that register their own listeners (just added one) be able to clean up after themselves. Imagine two tests that independently need to add such listeners and are run concurrently. I actually cargo culted a lot all of this from the RunnableCountCallback variant in this package BTW, so possibly it's too much for this listener since I'm doing no smoothing, fixed-point arithmetic, or dynamic changing of sampling periods. But I'll keep this as is unless you feel differently.
pkg/util/goschedstats/latency.go line 91 at r7 (raw file):
Previously, sumeerbhola wrote…
I don't quite understand the split between the code in server.go and the code here.
We have a single set of callbacks registered with the var declared above. That means all those callbacks are happy with a single sampling frequency, which is also declared as a cluster setting here. Can we expose aStartStatsTicker(*cluster.Settings, *stop.Stopper)function and hide all the ticking details in this file?
Added a StartStatsTicker here to house all the ticking code. (Aside: see new test that declares its own callback and may be run concurrently with server.go code doing the same.)
pkg/util/goschedstats/latency.go line 131 at r7 (raw file):
Previously, sumeerbhola wrote…
all this histogram fiddling logic needs a unit test.
Ok if you want to take a TODO to do this after the branch cut.
Added a couple of tests in this PR.
pkg/util/goschedstats/latency.go line 143 at r7 (raw file):
Previously, sumeerbhola wrote…
The bucket counts are guaranteed to not change in a running system?
Yes, these values are copied over to a new object when read: https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/runtime/metrics.go;l=328-334;bpv=0;bpt=0. There are aliasing caveats around the Buckets field (https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/runtime/metrics/histogram.go;l=29-32) which is why we clone it explicitly when subtracting two cumulative histograms to get interval statistics.
5cd3349 to
1bcee40
Compare
- Added tests for SchedulerLatencyListener, showing graphically how the
elastic CPU controller behaves in response to various terms in the
control loop (delta, multiplicative factor, smoothing constant, etc)
-- see snippet below for an example. Added datadriven tests for
ElasticCPUWorkQueue and ElasticCPUGranter. Added tests for
quotapool.RateLimiter changes.
# With more lag (first half of the graph), we're more likely to
# observe a large difference between the set-point we need to hit
# and the utilization we currently have, making for larger
# scheduling latency fluctuations (i.e. an ineffective controller).
plot width=70 height=20
----
----
1069 ┤ ╭╮ ╭╮╭╮
1060 ┤ ││ ││││
1052 ┤ ││ ││││╭╮ ╭╮
1044 ┤ ││ ╭╮ ││││││ ││
1035 ┤ ╭╮││ ││ ╭╮ ╭╮ ││││││ ││
1027 ┤ │││╰╮ ││ ││ ││ ││││││ ╭╯│ ╭╮ ╭╮ ╭╮
1019 ┤ │││ │ ││ ╭╮││ ││╭╮ ││││││ │ │ ││ ││ ││
1010 ┤ │││ │ ││ │╰╯│ ││││ ││││││ │ │ ││ ╭──╮│╰─╮ ││ ╭╮╭─
1002 ┼────────────────────────────────────────────────────────────────────
993 ┤╰╮│││ │ │││││ ││││ │ │││││╰╮│ ╰╮╭╯││╰─╯╰╯╰╮╭╮ │ ╰─╯╰╯ ╰╯
985 ┤ ││╰╯ │ │╰╯╰╯ ││││ │ │││││ ││ ╰╯ ╰╯ ╰╯╰─╯
977 ┤ ││ │ │ ││││ │ │││││ ││
968 ┤ ││ │╭╯ ││││ ╰╮ │││││ ││
960 ┤ ││ ││ ││││ │ │││││ ╰╯
951 ┤ ││ ││ ││││ │ │││││
943 ┤ ││ ││ ││││ │╭╯││││
935 ┤ ││ ││ ││╰╯ ││ ╰╯││
926 ┤ ││ ││ ││ ││ ╰╯
918 ┤ ╰╯ ╰╯ ││ ╰╯
910 ┤ ││
901 ┤ ╰╯
p99 scheduler latencies (μs)
21.7 ┤ ╭╮
20.6 ┤ ╭───╮ ╭╯╰───╮
19.5 ┼─────────╮ ╭──╮ ╭╮ ╭────╮ ╭────╮╭╮╭╮╭─╯ ╰─╯ ╰╮╭
18.4 ┤ │ │ │╭─╯╰╮ ╭───╮╮ │╭─╮ ╰────│ ╰╯╰╯╰╯ ╰╯
17.3 ┤ ╰──╭╯╮╭╰────╮╭─╯╯ ││╭╭╯ │ ╭╯
16.2 ┤ │ ╰╯ ╰╯╭╯ ╰╮╭╯ │ │
15.2 ┤ ╭╯ ╰╯ ╰╯ │ ╭╯
14.1 ┤ │ │ │
13.0 ┤ │ │ │
11.9 ┤ ╭╯ │ ╭╯
10.8 ┤ │ │ │
9.7 ┤ │ │ ╭╯
8.7 ┤ ╭╯ │ │
7.6 ┤ │ │ │
6.5 ┤ ╭╯ │ ╭╯
5.4 ┤ │ │ │
4.3 ┤ │ │╭╯
3.2 ┤ ╭╯ ││
2.2 ┤ │ ││
1.1 ┤ │ ╰╯
0.0 ┼───────╯
elastic cpu utilization and limit (%)
- Moved the "scheduler-latency-ticker" loop into the (new)
schedulerlatency package, housing just the sampler and callbacks.
Added tests for code mucking with histogram state.
- Also decoupled the sample period (controlling the sampling
frequency) from the sample duration (how far in the past the latency
histograms apply over). This should reduce some p99 jaggedness we
saw with latency histograms collected over short durations while
still driving the control loop at a high frequency (necessary for
good utilization).
- Added micro-benchmarks to evaluate the overhead of checking against
ElasticCPUHandle.OverLimit in a tight loop within MVCCExportToSST.
Used the following:
$ dev bench pkg/storage \
--filter BenchmarkMVCCExportToSST/useElasticCPUHandle --count 10 \
--timeout 20m -v --stream-output --ignore-cache 2>&1 | tee bench.txt
$ for flavor in useElasticCPUHandle=true useElasticCPUHandle=false
do
grep -E "${flavor}[^0-9]+" bench.txt | sed -E "s/${flavor}+/X/" > $flavor.txt
done
# goos: linux
# goarch: amd64
# cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
$ benchstat useElasticCPUHandle\={false,true}.txt
name old time/op new time/op delta
MVCCExportToSST/X 2.03s ± 3% 2.14s ± 2% +5.47% (p=0.000 n=9+10)
Given the 5% hit, this commit also adds some simple estimation of
per-iteration running time within the ElasticCPUHandle (with
corresponding tests) to avoid calling grunning.Time() so frequently.
We're able to claw back the perf hit:
$ benchstat useElasticCPUHandle\={false,true}.txt
name old time/op new time/op delta
MVCCExportToSST/X 2.54s ± 2% 2.53s ± 2% ~ (p=0.549 n=10+9)
Release note: None
- Rename some private cluster settings/metrics to be more internally
consistent; add validation for cluster setting ranges. Move some of
them out of the admission package.
- Substitute the RateLimiter used in the elasticCPUGranter by the
simpler TokenBucket instead.
- Forward grants to all waiting requests instead of bailing after the
first one.
- Improve the controller:
- Add commentary for elastic cpu controller; cribbing from the PR
description over at cockroachdb#86638.
- Skip the ewma smoothing of p99 latencies; now that latencies are
measured over large windows, the controller input is fairly stable,
obviating the need for this smoothing.
- When limit is unused, slowly decrease it to reduce likelihood of
(temporary) over admission.
- When adjusting CPU limit, don't look at observed utilization; look
only at current limit and whether there are requests waiting. There
were a few problems with the previous scheme:
- The way we measured utilization (used quota from quota pool) was
very jagged since the points at which we refill the quota pool are
only when we interact with it (i.e. there's no dedicated goroutine
on a timer). Experimentally, this made for a slower decrease than
necessary. When comparing a metric of in-used quota vs. used cpu
time, the former was a lot smoother.
- When the CPU limit was low, the controller had a very slow rise or
was altogether unreactive.
- Unreactive: Because we only raised the limit after observing
some fraction of available limit being used, consider what
happens if the utilization limit is at 1% in an 8vCPU machine.
The burst capacity of the token bucket = 1% * 8s = 80ms. If
we're relying on observing 90% of that value being in use, i.e.
72ms to have been acquired by elastic work, this is simply not
possible when the smallest unit of acquisition is larger, say 100ms.
- Slow-rise: Even if the limit is slightly higher, since the
controller uses a non-smoothed value, unless we sample right
when some work is being done, we're not going to react.
We really only introduced this min-util-fraction concept for one
reason: only increase CPU % if there are waiting requests quota
that possibly benefit from a larger %. This is something we can
check directly through the wait queue; we now do so. We previously
had a test demonstrating deficiencies of this min-util-fraction
scheme. The test was removed since it no longer applies, but
here's roughly what the change looks like:
- 25.0 ┼─╮ ╭─
- 23.8 ┤ ╰╮ ╭─╭─
- 22.5 ┤ ╰╮ ╭─╭─╯
- 21.2 ┤ │ ╭╯╭╯
- 20.0 ┤╭─╮╰╮ ╭─╭─╯
- 18.8 ┤│ ╰╮╰╮ ╭╯╭╯
- 17.5 ┤│ ╰╮╰╮ ╭─╭─╯
- 16.2 ┤│ ╰╮│ ╭╭─╯
- 15.0 ┤│ ╰╮╮ ╭─╭╯
- 13.8 ┤│ ╰╮╮ ╭─╭─╯
- 12.5 ┤│ ╰╮╮ ╭╭─╯
- 11.2 ┤│ ╰╮ ╭─╭╯
- 10.0 ┤│ ╰╮ ╭╭─╯
- 8.8 ┤│ ╰╮ ╭╭─╯
- 7.5 ┤│ │╮ ╭╭─╯
- 6.2 ┤│ ╰╮╮ ╭╭─╯
- 5.0 ┤│ ╰╮ ╭╭──╯
- 3.8 ┤│ ╰╮ ╭╭───╯
- 2.5 ┤│ ╰─╮ ╭╭────╯
- 1.2 ┤│ ╰╰──────────────╯
+ 25.2 ┼─╮
+ 23.9 ┤ ╰╮ ╭───────────
+ 22.6 ┤ ╰╮ ╭──────╭───────────
+ 21.4 ┤ │ ╭╯╭─────╯
+ 20.1 ┤╭─╮╰╮ ╭─╭─╯
+ 18.9 ┤│ ╰╮╰╮ ╭╯╭╯
+ 17.6 ┤│ ╰╮╰╮ ╭─╭─╯
+ 16.4 ┤│ ╰╮│ ╭╭─╯
+ 15.1 ┤│ ╰╮╮ ╭─╭╯
+ 13.8 ┤│ ╰╮╮ ╭╭─╯
+ 12.6 ┤│ ╰╮╮ ╭─╭╯
+ 11.3 ┤│ ╰╮╮ ╭╭─╯
+ 10.1 ┤│ ╰╮ ╭╭─╯
+ 8.8 ┤│ ╰╮ ╭─╭╯
+ 7.5 ┤│ ╰╮ ╭╭─╯
+ 6.3 ┤│ ╰╮ ╭╭─╯
+ 5.0 ┤│ ╰╮ ╭╭╯
+ 3.8 ┤│ ╰╮ ╭╭─╯
+ 2.5 ┤│ ╰─╮╮ ╭╭╯
+ 1.3 ┤│ ╰─────────╯
0.0 ┼╯
elastic cpu utilization and limit (%)
1bcee40 to
815bf3b
Compare
irfansharif
left a comment
There was a problem hiding this comment.
TFTR -- that was a lot to have had to read through! Only force pushing to rebase off of a pebble vendor bump.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @sumeerbhola)
pkg/util/admission/scheduler_latency_listener_test.go line 214 at r22 (raw file):
There's a bit of code duplication here that I didn't bother refactoring. This ticking happens under the "tick" directive where each tick is manually specified, the same loop exists in "auto" where only the tick count is specified.
Is it because the utilDelta, utilFrac etc. are also remembered and used in "auto"?
Indeed.
This can use some commentary at the top of this test.
Added.
And for the degenerate case that we are going to do many ticks with "auto" is there a reason we tick once in this tick command (in the testdata files), instead of specifying ticks=0?
No good reason, that too would work.
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from e5152ab to blathers/backport-release-22.2-86638: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Today when admission control admits a request, it is able to run indefinitely consuming arbitrary CPU. For long-running (~1s of CPU work per request) "elastic" (not latency sensitive) work like backups, this can have detrimental effects on foreground latencies – once such work is admitted, it can take up available CPU cores until completion, which prevents foreground work from running. The scheme below aims to change this behavior; there are two components in play:
Elastic work acquires CPU tokens representing some predetermined slice of CPU time, blocking until these tokens become available. We found that 100ms of tokens works well enough experimentally. A larger value, say 250ms, would translate to less preemption and fewer RPCs. What's important is that it isn't "too much", like 2s of CPU time, since that would let a single request hog a core potentially for 2s and allow for a large build up of a runnable goroutines (serving foreground traffic) on that core, affecting scheduling/foreground latencies.
The work preempts itself once the slice is used up (as a form of cooperative scheduling). Once preempted, the request returns to the caller with a resumption key. This scheme is effective in clamping down on scheduling latency that's due an excessive amount of elastic work. We have proof from direct trace captures and instrumentation that reducing scheduling latencies directly translates to reduced foreground latencies. They're primarily felt when straddling goroutines, typically around RPC boundaries (request/response handling goroutines); the effects multiplicative for statements that issue multiple requests.
The controller uses fixed deltas for adjustments, adjusting down a bit more aggressively than adjusting up. This is due to the nature of the work being paced — we care more about quickly introducing a ceiling rather than staying near it (though experimentally we’re able to stay near it just fine). It adjusts upwards only when seeing a reasonably high % of utilization with the allotted CPU quota (assuming it’s under the p99 target). The adjustments are small to reduce {over,under}shoot and controller instability at the cost of being somewhat dampened. We use a smoothed form of the p99 latency captures to add stability to the controller input, which consequently affects the controller output. We use a relatively low frequency when sampling scheduler latencies; since the p99 is computed off of histogram data, we saw a lot more jaggedness when taking p99s off of a smaller set of scheduler events (every 50ms for ex.) compared to computing p99s over a larger set of scheduler events (every 2500ms). This, with the small deltas used for adjustments, can make for a dampened response, but assuming a stable-ish foreground CPU load against a node, it works fine. The controller output is limited to a well-defined range that can be tuned through cluster settings.
Miscellaneous code details: To evaluate the overhead of checking against ElasticCPUHandle.OverLimit in a tight loop within MVCCExportToSST, we used the following. Underneath the hood the handle does simple estimation of per-iteration running time to avoid calling grunning.Time() frequently; not doing so caused a 5% slowdown in the same benchmark.
The tests for SchedulerLatencyListener show graphically how the elastic CPU controller behaves in response to various terms in the control loop (delta, multiplicative factor, smoothing constant, etc) -- see snippet below for an example.
[1]: Specifically the time between a goroutine being ready to run and when it's scheduled to do so by the Go scheduler.
Release note: None
Release justification: Non-production code