admission,kvserver: introduce an elastic cpu limiter by irfansharif · Pull Request #86638 · cockroachdb/cockroach

irfansharif · 2022-08-23T05:46:15Z

Today when admission control admits a request, it is able to run indefinitely consuming arbitrary CPU. For long-running (~1s of CPU work per request) "elastic" (not latency sensitive) work like backups, this can have detrimental effects on foreground latencies – once such work is admitted, it can take up available CPU cores until completion, which prevents foreground work from running. The scheme below aims to change this behavior; there are two components in play:

A token bucket that hands out slices of CPU time where the total amount handed out is determined by a 'target utilization' – the max % of CPU it's aiming to use (on a 8vCPU machine, if targeting 50% CPU, it can hand out .50 * 8 = 4 seconds of CPU time per second).
A feedback controller that adjusts the CPU % used by the token bucket periodically by measuring scheduling latency[1]. If over the limit (1ms at p99, chosen experimentally), the % is reduced; if under the limit and we're seeing substantial utilization, the % is increased.

Elastic work acquires CPU tokens representing some predetermined slice of CPU time, blocking until these tokens become available. We found that 100ms of tokens works well enough experimentally. A larger value, say 250ms, would translate to less preemption and fewer RPCs. What's important is that it isn't "too much", like 2s of CPU time, since that would let a single request hog a core potentially for 2s and allow for a large build up of a runnable goroutines (serving foreground traffic) on that core, affecting scheduling/foreground latencies.

The work preempts itself once the slice is used up (as a form of cooperative scheduling). Once preempted, the request returns to the caller with a resumption key. This scheme is effective in clamping down on scheduling latency that's due an excessive amount of elastic work. We have proof from direct trace captures and instrumentation that reducing scheduling latencies directly translates to reduced foreground latencies. They're primarily felt when straddling goroutines, typically around RPC boundaries (request/response handling goroutines); the effects multiplicative for statements that issue multiple requests.

The controller uses fixed deltas for adjustments, adjusting down a bit more aggressively than adjusting up. This is due to the nature of the work being paced — we care more about quickly introducing a ceiling rather than staying near it (though experimentally we’re able to stay near it just fine). It adjusts upwards only when seeing a reasonably high % of utilization with the allotted CPU quota (assuming it’s under the p99 target). The adjustments are small to reduce {over,under}shoot and controller instability at the cost of being somewhat dampened. We use a smoothed form of the p99 latency captures to add stability to the controller input, which consequently affects the controller output. We use a relatively low frequency when sampling scheduler latencies; since the p99 is computed off of histogram data, we saw a lot more jaggedness when taking p99s off of a smaller set of scheduler events (every 50ms for ex.) compared to computing p99s over a larger set of scheduler events (every 2500ms). This, with the small deltas used for adjustments, can make for a dampened response, but assuming a stable-ish foreground CPU load against a node, it works fine. The controller output is limited to a well-defined range that can be tuned through cluster settings.

Miscellaneous code details: To evaluate the overhead of checking against ElasticCPUHandle.OverLimit in a tight loop within MVCCExportToSST, we used the following. Underneath the hood the handle does simple estimation of per-iteration running time to avoid calling grunning.Time() frequently; not doing so caused a 5% slowdown in the same benchmark.

  $ dev bench pkg/storage \
    --filter BenchmarkMVCCExportToSST/useElasticCPUHandle --count 10 \
    --timeout 20m -v --stream-output --ignore-cache 2>&1 | tee bench.txt

  $ for flavor in useElasticCPUHandle=true useElasticCPUHandle=false
    do
      grep -E "${flavor}[^0-9]+" bench.txt | sed -E "s/${flavor}+/X/" > $flavor.txt
    done

  # goos: linux
  # goarch: amd64
  # cpu: Intel(R) Xeon(R) CPU @ 2.20GHz

  $ benchstat useElasticCPUHandle\={false,true}.txt
    name                old time/op  new time/op  delta
    MVCCExportToSST/X   2.54s ± 2%   2.53s ± 2%   ~     (p=0.549 n=10+9)

The tests for SchedulerLatencyListener show graphically how the elastic CPU controller behaves in response to various terms in the control loop (delta, multiplicative factor, smoothing constant, etc) -- see snippet below for an example.

  # With more lag (first half of the graph), we're more likely to
  # observe a large difference between the set-point we need to hit
  # and the utilization we currently have, making for larger
  # scheduling latency fluctuations (i.e. an ineffective controller).
  plot width=70 height=20
  ----
  ----
   1069 ┤    ╭╮                   ╭╮╭╮
   1060 ┤    ││                   ││││
   1052 ┤    ││                   ││││╭╮  ╭╮
   1044 ┤    ││  ╭╮               ││││││  ││
   1035 ┤  ╭╮││  ││    ╭╮   ╭╮    ││││││  ││
   1027 ┤  │││╰╮ ││    ││   ││    ││││││ ╭╯│         ╭╮         ╭╮   ╭╮
   1019 ┤  │││ │ ││  ╭╮││   ││╭╮  ││││││ │ │         ││         ││   ││
   1010 ┤  │││ │ ││  │╰╯│   ││││  ││││││ │ │         ││     ╭──╮│╰─╮ ││  ╭╮╭─
   1002 ┼────────────────────────────────────────────────────────────────────
    993 ┤╰╮│││ │ │││││   ││││  │  │││││╰╮│ ╰╮╭╯││╰─╯╰╯╰╮╭╮ │       ╰─╯╰╯  ╰╯
    985 ┤ ││╰╯ │ │╰╯╰╯   ││││  │  │││││ ││  ╰╯ ╰╯      ╰╯╰─╯
    977 ┤ ││   │ │       ││││  │  │││││ ││
    968 ┤ ││   │╭╯       ││││  ╰╮ │││││ ││
    960 ┤ ││   ││        ││││   │ │││││ ╰╯
    951 ┤ ││   ││        ││││   │ │││││
    943 ┤ ││   ││        ││││   │╭╯││││
    935 ┤ ││   ││        ││╰╯   ││ ╰╯││
    926 ┤ ││   ││        ││     ││   ╰╯
    918 ┤ ╰╯   ╰╯        ││     ╰╯
    910 ┤                ││
    901 ┤                ╰╯
                             p99 scheduler latencies (μs)

   21.7 ┤                                                            ╭╮
   20.6 ┤                                                     ╭───╮ ╭╯╰───╮
   19.5 ┼─────────╮   ╭──╮  ╭╮          ╭────╮    ╭────╮╭╮╭╮╭─╯   ╰─╯     ╰╮╭
   18.4 ┤         │   │  │╭─╯╰╮  ╭───╮╮ │╭─╮ ╰────│    ╰╯╰╯╰╯              ╰╯
   17.3 ┤         ╰──╭╯╮╭╰────╮╭─╯╯  ││╭╭╯ │     ╭╯
   16.2 ┤            │ ╰╯     ╰╯╭╯   ╰╮╭╯  │     │
   15.2 ┤           ╭╯         ╰╯     ╰╯   │    ╭╯
   14.1 ┤           │                      │    │
   13.0 ┤           │                      │    │
   11.9 ┤          ╭╯                      │   ╭╯
   10.8 ┤          │                       │   │
    9.7 ┤          │                       │  ╭╯
    8.7 ┤         ╭╯                       │  │
    7.6 ┤         │                        │  │
    6.5 ┤        ╭╯                        │ ╭╯
    5.4 ┤        │                         │ │
    4.3 ┤        │                         │╭╯
    3.2 ┤       ╭╯                         ││
    2.2 ┤       │                          ││
    1.1 ┤       │                          ╰╯
    0.0 ┼───────╯
                         elastic cpu utilization and limit (%)

[1]: Specifically the time between a goroutine being ready to run and when it's scheduled to do so by the Go scheduler.

Release note: None
Release justification: Non-production code

cockroach-teamcity · 2022-08-23T05:46:25Z

This change is

irfansharif · 2022-08-24T07:57:06Z

@sumeerbhola @andrewbaptist this is ready for a look. There are some TODOs (marked with XXX:) in there I need to address (missing tests, a microbenchmark, and granting requests more frequently to ensure high utilization) that I'll do while you review.

sumeerbhola

Reviewed 1 of 1 files at r1, 27 of 27 files at r5, 23 of 29 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andrewbaptist, @irfansharif, and @sumeerbhola)

pkg/kv/kvserver/store.go line 3815 at r7 (raw file):

	}

	bypassAdmission := ba.IsAdmin()

There is also rearrangement of logic here. It looks correct to me, but this is also a part that has no existing tests (my fault).
Can you add a TODO to write a unit test for this -- there are enough cases that make me nervous. Fine to do it after the branch cut.

pkg/kv/kvserver/store.go line 3856 at r7 (raw file):

			if admissionEnabled {
				defer func() {
					if retErr != nil {

does a return from this function that uses a different variable automatically populate retErr before the defer is run?

pkg/kv/kvserver/kvadmissionhandle/handle.go line 54 at r7 (raw file):

// ContextWithHandle returns a Context wrapping the supplied kvadmission handle.
func ContextWithHandle(ctx context.Context, h Handle) context.Context {
	return context.WithValue(ctx, handleKey{}, h)

this is a memory allocation, yes?
Is this only being used plumb the ability to call OverElasticCPULimit? If yes, can we call this something like ContextWithElasticCPUWorkHandle and only do the allocation for those elastic cases. Those seem heavy weight enough for the allocation to not matter.

pkg/storage/mvcc.go line 5880 at r7 (raw file):

		// prefer callers being able to use SSTs directly). Going over limit is
		// accounted for in admission control by penalizing the subsequent
		// request, so doing it slightly is find.

s/find/fine/

pkg/util/admission/elastic_cpu_granter.go line 182 at r7 (raw file):

		tokens := e.requester.granted(noGrantChain)
		if tokens == 0 {
			return // requester didn't accept, nothing to do

don't we give back the 1 token that was consumed?

pkg/util/admission/elastic_cpu_granter.go line 191 at r7 (raw file):

// TODO(irfansharif): Provide separate enums for different elastic CPU token
// sizes? (1ms, 10ms, 100ms). Write up something about picking the right value.
// Can this value be auto-estimated?

(note to self) haven't looked at these metrics yet

pkg/util/admission/elastic_cpu_utilization_adjuster.go line 21 at r7 (raw file):

)

var _ elasticCPUUtilizationAdjuster = &elasticCPUGranter{}

Is this interface being abstracted for testing?
Even if this is, it seems unnecessary to put this simple functionality in a separate file from elasticCPUGranter.
"Adjuster" suggests that this is the one making the decision based on load info, like kvSlotAdjuster. This is just a listener.

pkg/util/admission/elastic_cpu_utilization_adjuster.go line 51 at r7 (raw file):

		float64(int64(runtime.GOMAXPROCS(0))*time.Second.Nanoseconds())
	return e.getTargetUtilization() *
		float64(int64(totalElasticCPUTime)-availableElasticCPUTime.Nanoseconds()) / totalElasticCPUTime

I didn't understand what is going on in this function. We have some math that seems to be usedCPUTimeTokens/totalCPUTimeTokens. Are the numerator and denominator the tokens for 1s? Then the numerator is doing a subtraction: what invariant ensures that the results in not negative? Can the first value in a-b be small because we just recently reduced the target utilization, but b be large because we still have some burst tokens accumulated?
This could use code comments and a crisper statement of invariants.

pkg/util/admission/elastic_cpu_work_queue.go line 56 at r7 (raw file):

	}
	e.metrics.AcquiredNanos.Inc(duration.Nanoseconds())
	e.metrics.Acquisitions.Inc(1)

there is a workQueue metric for admitted count already.

pkg/util/admission/elastic_cpu_work_queue.go line 136 at r7 (raw file):

		return true, grunning.Subtract(runningTime, h.allotted)
	}
	return false, grunning.Subtract(h.allotted, runningTime)

changing the subtraction order for the 2 cases is confusing. There should be a single meaning to what this return value means.

pkg/util/admission/grant_coordinator.go line 979 at r7 (raw file):

// SQL-level admission. All this informs why its structured as a separate grant
// coordinator.
type ElasticCPUGrantCoordinator struct {

I didn't expect this to see a new GrantCoordinator implementation. Even StoreGrantCoordinators is able to use a GrantCoordinator for KV write work with tokens. Using the normal implementation allows us to reuse a bunch of mediating code. In that KV write case we needed a way to break out of the abstraction at done time, which is why we "leaked" a granterWithStoreWriteDone to the StoreWorkQueue. In this elastic cpu case, I think the situation is even simpler, in that we can simply call granter.{returnGrant,tookWithoutPermission} at done time.

I do see why this works in the sense that you can intercept all the work in the granter implementation itself, and the granter implementation knows that it is being used in this context only (and not with a regular GrantCoordinator).
This is ok-ish for now. But please add a code comment stating how this implementation is breaking with the normal mode of things. The existing abstractions have some creaking aspects now, so they are worth revisiting.

pkg/util/admission/scheduler_latency_listener.go line 2 at r7 (raw file):

// Copyright 2022 The Cockroach Authors.
//

haven't read the code in this file

pkg/util/goschedstats/latency.go line 70 at r7 (raw file):

	}
	if len(newCBs)+1 != len(schedulerLatencyCallbackInfo.callbacks) {
		panic(errors.AssertionFailedf("unexpected unregister: new count %d, old count %d",

seems extreme to panic if someone tried to unregister with an non-existent id. Is this because unregister will not be called in production code? If so, add a code comment justifying this.

pkg/util/goschedstats/latency.go line 83 at r7 (raw file):

var schedulerLatencyCallbackInfo struct {
	mu        syncutil.Mutex
	id        int64

what is the purpose of this id?

pkg/util/goschedstats/latency.go line 91 at r7 (raw file):

type SchedulerLatencyStatsTicker struct {
	lastSample *metrics.Float64Histogram
}

I don't quite understand the split between the code in server.go and the code here.
We have a single set of callbacks registered with the var declared above. That means all those callbacks are happy with a single sampling frequency, which is also declared as a cluster setting here. Can we expose a StartStatsTicker(*cluster.Settings, *stop.Stopper) function and hide all the ticking details in this file?

pkg/util/goschedstats/latency.go line 131 at r7 (raw file):

}

func clone(h *metrics.Float64Histogram) *metrics.Float64Histogram {

all this histogram fiddling logic needs a unit test.
Ok if you want to take a TODO to do this after the branch cut.

pkg/util/goschedstats/latency.go line 143 at r7 (raw file):

func sub(a, b *metrics.Float64Histogram) *metrics.Float64Histogram {
	res := clone(a)
	for i := 0; i < len(res.Counts); i++ {

The bucket counts are guaranteed to not change in a running system?

irfansharif

Flushing out some comments, review changes, and tests. There's still more tests I need to write.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @irfansharif, and @sumeerbhola)

pkg/kv/kvserver/store.go line 3815 at r7 (raw file):