admission,storage: introduce flush tokens to constrain write admission by sumeerbhola · Pull Request #82440 · cockroachdb/cockroach

sumeerbhola · 2022-06-04T01:15:44Z

In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):

Measure and smooth the peak rate that the flush loop can operate on.
This relies on the recently added pebble.InternalIntervalMetrics.
The peak rate causes 100% utilization of the single flush thread,
and that is potentially too high to prevent write stalls (depending
on how long it takes to do a single flush). So we multiply the
smoothed peak rate by a utilization-target-fraction which is
dynamically adjusted and by default is constrained to the interval
[0.5, 1.5]. There is additive increase and decrease of this
fraction:
- High usage of tokens and no write stalls cause an additive increase
- Write stalls cause an additive decrease. A small multiplier is used
  if there are multiple write stalls, so that the probing falls
  more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:

early-life: when the store had 10-20GB of data, when the compaction
backlog was not very heavy, so there was less queueing for the
limited disk bandwidth (it was still usually saturated).
later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Screenshots for early-life: Flush tokens were turned off at 22:32:30. Prior to that the flush utilization-target-fraction was 0.625.

Screenshots for later-life: Flush tokens were turned off at 22:03:20. Prior to that the flush utilization-target-fraction was 0.875.

These results were produced by running

roachprod create -n 2 --clouds aws  --aws-machine-type=c5d.9xlarge --local-ssd=false --aws-ebs-volume-type=gp2 sumeer-io
roachprod put sumeer-io:1 cockroach ./cockroach
roachprod put sumeer-io:2 workload ./workload
roachprod start sumeer-io --env "COCKROACH_ROCKSDB_CONCURRENCY=8"
roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-1}

Fixes #77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.

cockroach-teamcity · 2022-06-04T01:15:52Z

This change is

tbg

This looks pretty good. The IO granter code is getting quite large, though I'm hesitant to suggest breaking it down in this PR.

Could you reword the release note a bit? It's a little choppy. Also, make sure to point out that the setting comes with a default of 0.5 (I guess 50 since we're not doing fractions) and so this protection is on by default.

tbg · 2022-06-08T10:25:13Z

pkg/util/admission/granter.go


+// MinFlushUtilizationPercentage is a lower-bound on the dynamically adjusted
+// flush utilization target that attempts to reduce write stalls. Set it to a
+// high percentage, like 1000, to effectively disable flush based tokens.


is using a "0-100" percentage consistent with other knobs or should this be a fractional number?

tbg · 2022-06-08T10:26:24Z

pkg/util/admission/granter.go

+var MinFlushUtilizationPercentage = settings.RegisterIntSetting(
+	settings.SystemOnly,
+	"admission.min_flush_util_percent",
+	"when computing flush tokens, this percentage is used as a lower bound on the dynamic"+


Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?

Maybe

A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.

tbg · 2022-06-08T10:32:22Z

pkg/util/admission/granter.go

 	smoothedIntIngestedAccountedL0BytesFraction float64 // '1.0' means: all ingested bytes went to L0
-	smoothedTotalNumByteTokens                  float64 // smoothed history of token bucket sizes
+	// Smoothed history of byte tokens calculated based on compactions out of L0.
+	smoothedCompactionByteTokens float64


note to self: this is renamed from smoothedTotalNumByteTokens.

tbg · 2022-06-08T10:34:21Z

pkg/util/admission/granter.go

+	// Smoothed history of byte tokens calculated based on compactions out of L0.
+	smoothedCompactionByteTokens float64
+
+	// Number of flush tokens, before multiplying by target fraction.


// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.

tbg · 2022-06-08T11:09:50Z

pkg/util/admission/granter.go

+			smoothedNumFlushTokens = alpha*intFlushTokens + (1-alpha)*prev.smoothedNumFlushTokens
+		}
+		const flushUtilTargetFractionIncrement = 0.025
+		// Have we used >= 90% of the tokens.


// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?

tbg · 2022-06-08T11:13:58Z

pkg/util/admission/granter.go

+		// Else avoid overflow by using the previously set unlimitedTokens. This
+		// should not really happen.
+	}
+	// Else intFlushUtilization is too low. We don't want to make token


// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...

tbg · 2022-06-08T11:22:57Z

pkg/util/admission/granter.go

+		res.flushUtilTargetFraction, ib(int64(res.smoothedNumFlushTokens*res.flushUtilTargetFraction)))
 	p.Printf("admitting ")
 	if n := res.ioLoadListenerState.totalNumByteTokens; n < unlimitedTokens {
 		p.Printf("%s", ib(n))


while we're here, can we make this a rate? So something like "%s/s", ib(float64(n)/float64(intervalLength*time.Second))?

tbg · 2022-06-08T11:25:33Z

pkg/util/admission/granter.go

-	p.Printf("compacted %s [≈%s]; ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes))
-
+	p.Printf("compacted %s [≈%s], ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes))
+	p.Printf("flush tokens %s with multiplier %.3f [≈%s]; ", ib(int64(res.aux.intFlushTokens)),


This isn't the right place for the multiplier, no? I would expect something here like

p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))

so basically just printing the flush stats.

Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:

admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f

pkg/kv/kvserver/metrics.go

sumeerbhola · 2022-06-09T01:32:02Z

No regression in tpccbench/nodes=3/cpu=16 with this change. Both this and master ended with MAX WAREHOUSES = 3460

sumeerbhola

TFTR!

Reworded the release note.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)

pkg/kv/kvserver/metrics.go line 492 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Consider adding an alerting rule for this metric. I assume SRE (@joshimhoff) would like to alert on write stalls eventually.

cockroach/pkg/kv/kvserver/metric_rules.go

Lines 57 to 79 in af221d4

func createAndRegisterUnavailableRangesRule(

ctx context.Context, ruleRegistry *metric.RuleRegistry,

) {

expr := "(sum by(instance, cluster) (ranges_unavailable)) > 0"

var annotations []metric.LabelPair

annotations = append(annotations, metric.LabelPair{

Name: proto.String("summary"),

Value: proto.String("Instance {{ $labels.instance }} has {{ $value }} unavailable ranges"),

})

recommendedHoldDuration := 10 * time.Minute

help := "This check detects when the number of ranges with less than quorum replicas live are non-zero for too long"

rule, err := metric.NewAlertingRule(

unavailableRangesRuleName,

expr,

annotations,

nil,

recommendedHoldDuration,

help,

true,

)

MaybeAddRuleToRegistry(ctx, err, unavailableRangesRuleName, rule, ruleRegistry)

}

Are these for alerts based on prometheus metrics? I am not sure what threshold is reasonable for SRE. I don't think there should be any aggregation here. I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
@joshimhoff do you have an opinion?

pkg/util/admission/granter.go line 61 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is using a "0-100" percentage consistent with other knobs or should this be a fractional number?

Changed to a fraction.

pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?

Maybe

A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.

Done

pkg/util/admission/granter.go line 1635 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

note to self: this is renamed from smoothedTotalNumByteTokens.

Ack

pkg/util/admission/granter.go line 1637 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.

Done

pkg/util/admission/granter.go line 2073 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?

Done

pkg/util/admission/granter.go line 2111 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...

Done

pkg/util/admission/granter.go line 2233 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't the right place for the multiplier, no? I would expect something here like
p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))
so basically just printing the flush stats.

Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:

admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f

Done

pkg/util/admission/granter.go line 2237 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

while we're here, can we make this a rate? So something like "%s/s", ib(float64(n)/float64(intervalLength*time.Second))?

Done

joshimhoff · 2022-06-14T18:49:45Z

I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
@joshimhoff do you have an opinion?

I agree.

If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right? I base this on:

The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

If we are talking about ~indefinite write stalls induced by disk availability issues on the cloud provider side, I expect that something like #76081 will catch impact if it is significant enough to warrant paging SRE, e.g. if a range becomes unavailable as it did with a ~recent cloud customer. Also we already have controls in place to auto-restart nodes in case of such an incident (and we page SRE on enough restarts).

As always, what SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on.

abarganier

From an o11y standpoint, defining an alerting rule was the main point I wanted to bring up. Thanks @tbg for being an advocate!

I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.

SGTM from an SRE-specific alert perspective.

SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on.

I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?

If there's some threshold that, in all cases, would indicate that investigation is required, I think an alerting rule would be a great way for us to provide guidance to customers on how to practically make use of this metric. I realize this isn't always the case, so if it doesn't make sense to do so, that's fine too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)

tbg · 2022-06-16T17:24:52Z

@sumeerbhola please re-request review when you're back and we can push this over the finish line quickly.

sumeerbhola · 2022-07-06T13:34:56Z

CI failure is the same flake as #79109

sumeerbhola

If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right?

Yes, it should capture it.

I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?

It's very possible that the write stall rate (seconds/second) that is tolerable is dependent on the workload. And that rate doesn't even give a good sense of the latency impact, which a prober or the customer measuring their own latency would do. Based on this discussion thread I am more inclined now towards treating this new metric as something that is useful primarily for diagnosis, and not for alerting.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)

tbg

but looks like there's a rebase, could you make sure the rebase happens separately from any additions? For example, do the rebase next, and make any changes prompted by my comments in a separate fixup commit (which we can squash before merge). Thanks!

Reviewed 1 of 9 files at r1, 6 of 9 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @sumeerbhola)

-- commits line 78 at r3:
This could use a bit more context, like

Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.

pkg/storage/engine.go line 922 at r2 (raw file):

	// calls to this method. Hence, this should be used with care, with only one
	// caller, which is currently the admission control subsystem.
	GetInternalIntervalMetrics() *pebble.InternalIntervalMetrics

This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.

pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, sumeerbhola wrote…

Done

Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, @sumeerbhola, and @tbg)

-- commits line 78 at r3:

Previously, tbg (Tobias Grieger) wrote…

This could use a bit more context, like

Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.

Done

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)

pkg/storage/engine.go line 922 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.

It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.

// InternalIntervalMetrics exposes metrics about internal subsystems, that can
// be useful for deep observability purposes, and for higher-level admission
// control systems that are trying to estimate the capacity of the DB. These
// are experimental and subject to change, since they expose internal
// implementation details, so do not rely on these without discussion with the
// Pebble team.
// These represent the metrics over the interval of time from the last call to
// retrieve these metrics. These are not cumulative, unlike Metrics. The main
// challenge in making these cumulative is the hdrhistogram.Histogram, which
// does not have the ability to subtract a histogram from a preceding metric
// retrieval.

pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.

Done (duplicated more of the code comment into the setting help).

irfansharif · 2022-07-14T03:50:18Z

pkg/util/admission/granter.go

+	// typically the more persistent bottleneck than flushes for the following
+	// reason:
+	// There is a dedicated flush thread. In comparison, with a write amp of W
+	// and maximum compaction concurrency of C, only approximately C/W threads


For my own learning, how did you arrive at the C/W figure for the number of L0 readers?

irfansharif · 2022-07-14T04:03:46Z

pkg/util/admission/granter.go

+		}
+		flushTokensFloat := flushUtilTargetFraction * smoothedNumFlushTokens
+		if flushTokensFloat < float64(math.MaxInt64) {
+			numFlushTokens = int64(flushTokensFloat)


Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?

irfansharif · 2022-07-14T04:05:50Z

pkg/util/admission/granter.go

+	// 0.16 to 0.9 in a single interval, when compaction tokens are not limited,
+	// hence we have set flushUtilIgnoreThreshold to a very low value. If we've
+	// erred towards it being too low, we run the risk of computing incorrect
+	// tokens. If we've erred towards being too high, we run the risk of giving


For my own understanding: If we're too low, we're risking underutilization?

irfansharif · 2022-07-14T04:09:58Z

pkg/util/admission/testdata/io_load_listener

+tick: 0, setAvailableIOTokens: unlimited
+
+# Flush loop utilization is high enough.
+set-state l0-bytes=10000 l0-added=11000 l0-files=1 l0-sublevels=1 flush-bytes=1000 flush-work-sec=2 flush-idle-sec=10 print-only-first-tick=true


[nit] These tests would be easier to read if the set-state directive was such that it kept all the previous values on overwrote only those stated (set-state write-stall-count=1 in this example)

With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.

tbg

Reviewed 10 of 10 files at r4, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)

pkg/storage/engine.go line 922 at r2 (raw file):

Previously, sumeerbhola wrote…

It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.

// InternalIntervalMetrics exposes metrics about internal subsystems, that can
// be useful for deep observability purposes, and for higher-level admission
// control systems that are trying to estimate the capacity of the DB. These
// are experimental and subject to change, since they expose internal
// implementation details, so do not rely on these without discussion with the
// Pebble team.
// These represent the metrics over the interval of time from the last call to
// retrieve these metrics. These are not cumulative, unlike Metrics. The main
// challenge in making these cumulative is the hdrhistogram.Histogram, which
// does not have the ability to subtract a histogram from a preceding metric
// retrieval.

I don't understand, isn't the hdrhistogram returned in *InternalIntervalMetrics cumulative, then? Looking at the impl of GetInternalIntervalMetrics() it looks like we're just attaching a long-lived hdrhistogram, so no resetting is taking place.
However, we are resetting flushWriteThroughput but this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.

Besides, one can take deltas of hdrhistograms:

	var g, h *hdrhistogram.Histogram // assumed to use same parameters
	snapg, snaph := g.Export(), h.Export()
	for i := range snapg.Counts {
		snaph.Counts[i] -= snapg.Counts[i]
	}
	delta := hdrhistogram.Import(snaph) // delta == h-g

tbg

I'm okay with fixing up the PebbleIntervalMetrics separately.

In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_fraction, which defaults to 0.5 (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 10. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls, which can be caused by memtable flushes becoming a bottleneck. This is done by limiting write tokens based on flush throughput, so as to reduce storage layer write stalls. Consequently, write tokens are now limited both by flush throughput, and by compaction throughput out of L0. This behavior is enabled by default. The new cluster setting admission.min_flush_util_fraction, defaulting to 0.5, can be used to disable or tune flush throughput based admission tokens. Setting it to a value much much greater than 1, say 10, will disable flush based tokens. Tuning the behavior, without disabling it, should be done only on the recommendation of a domain expert.

sumeerbhola

I'm okay with fixing up the PebbleIntervalMetrics separately.
Ack

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)

pkg/storage/engine.go line 922 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't understand, isn't the hdrhistogram returned in *InternalIntervalMetrics cumulative, then? Looking at the impl of GetInternalIntervalMetrics() it looks like we're just attaching a long-lived hdrhistogram, so no resetting is taking place.
However, we are resetting flushWriteThroughput but this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.

Besides, one can take deltas of hdrhistograms:
	var g, h *hdrhistogram.Histogram // assumed to use same parameters
	snapg, snaph := g.Export(), h.Export()
	for i := range snapg.Counts {
		snaph.Counts[i] -= snapg.Counts[i]
	}
	delta := hdrhistogram.Import(snaph) // delta == h-g

No, the hdrhistogram is not cumulative. Since once we grab d.mu.log.metrics we set it to nil

		m.LogWriter.SyncLatencyMicros = d.mu.log.metrics.SyncLatencyMicros
		d.mu.log.metrics = nil

Thanks for the pointer regarding deltas -- I should have looked more closely at the implementation (the sub-bucket stuff threw me off and I thought it was doing data dependent buckets, and not deterministic buckets based on the initial parameters).

pkg/util/admission/granter.go line 2285 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my own learning, how did you arrive at the C/W figure for the number of L0 readers?

I've modified this comment
// There is a dedicated flush thread. With a maximum compaction concurrency
// of C, we have up to C threads dedicated to handling the write-amp of W
// (caused by rewriting the same data). So C/(W-1) threads on average are
// reading the original data (that will be rewritten W-1 times). Since L0
// can have multiple overlapping files, and intra-L0 compactions are usually
// avoided, we can assume (at best) that the original data (in L0) is being
// read only when compacting to levels lower than L0. That is, C/(W-1)
// threads are reading from L0 to compact to levels lower than L0. Since W
// can be 20+ and C defaults to 3 (we plan to dynamically adjust C but one
// can expect C to be <= 10), C/(W-1) < 1. So the main reason we are
// considering flush tokens is transient flush bottlenecks, and workloads
// where W is small.

pkg/util/admission/granter.go line 2364 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?

If someone finds it useful. To understand what is really happening with these tokens, I needed to look at multiple numbers, hence the log statement printed every 15s.

pkg/util/admission/granter.go line 2375 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my own understanding: If we're too low, we're risking underutilization?

This is talking about flushUtilIgnoreThreshold. It that is too low we may have cases of intFlushUtilization > flushUtilIgnoreThreshold that result in poor quality token computation. Whether that poor quality errs towards fewer tokens than appropriate or more tokens than appropriate, I don't know.

pkg/util/admission/testdata/io_load_listener line 320 at r6 (raw file):
It's sometimes hard to understand a datadriven test if each directive is not self contained -- one has to look at the whole test history. Currently one has to look at the previous directive, since many of these values are cumulative. One could make a case to take the delta of these cumulative values in the directive, to make it more self contained. I'm going to leave such debatable changes for a future PR.

With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.

The lack of comments was an oversight. I've added them now.

sumeerbhola

TFTRs!

I've squashed the commits -- I plan to merge when CI is green.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)

sumeerbhola · 2022-07-14T16:07:09Z

bors r=tbg,irfansharif

craig · 2022-07-14T16:26:38Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-14T16:38:10Z

Build failed:

GitHub CI (Cockroach)

sumeerbhola · 2022-07-14T16:52:37Z

TestWorkload flaked -- #78478

sumeerbhola · 2022-07-14T16:54:06Z

bors retry

craig · 2022-07-14T18:16:32Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-14T19:31:50Z

Build succeeded:

GitHub CI (Cockroach)

sumeerbhola requested a review from a team June 4, 2022 01:15

sumeerbhola requested review from a team as code owners June 4, 2022 01:15

sumeerbhola removed the request for review from a team June 4, 2022 01:16

sumeerbhola force-pushed the flush_tokens branch from db5396c to 585b9b4 Compare June 4, 2022 01:20

irfansharif mentioned this pull request Jun 7, 2022

kv,bulkio: throttle per-store column/index backfill requests #82556

Closed

tbg reviewed Jun 8, 2022

View reviewed changes

sumeerbhola mentioned this pull request Jun 8, 2022

admission: graceful degradation #82114

Closed

sumeerbhola requested a review from irfansharif June 8, 2022 15:41

lunevalex mentioned this pull request Jun 10, 2022

kv: build a reproduction for a column backfill impacting foreground traffic #82700

Closed

tbg mentioned this pull request Jun 10, 2022

kvserver: permanent starvation of foreground traffic in admission/follower-overload/presplit-with-leases #82713

Open

sumeerbhola force-pushed the flush_tokens branch from 585b9b4 to 46675ac Compare June 14, 2022 16:15

sumeerbhola requested a review from a team June 14, 2022 16:15

sumeerbhola force-pushed the flush_tokens branch from 46675ac to 78f2dc2 Compare June 14, 2022 18:24

sumeerbhola commented Jun 14, 2022

View reviewed changes

abarganier reviewed Jun 15, 2022

View reviewed changes

irfansharif added the A-admission-control label Jun 15, 2022

tbg requested review from tbg and removed request for tbg June 16, 2022 17:18

sumeerbhola commented Jul 6, 2022

View reviewed changes

sumeerbhola requested review from abarganier and tbg July 6, 2022 14:25

sumeerbhola force-pushed the flush_tokens branch from 78f2dc2 to a759c19 Compare July 6, 2022 14:36

sumeerbhola requested a review from a team as a code owner July 6, 2022 14:36

tbg approved these changes Jul 13, 2022

View reviewed changes

sumeerbhola force-pushed the flush_tokens branch 2 times, most recently from 5db234a to e302759 Compare July 13, 2022 14:35

sumeerbhola commented Jul 13, 2022

View reviewed changes

irfansharif approved these changes Jul 14, 2022

View reviewed changes

tbg reviewed Jul 14, 2022

View reviewed changes

tbg approved these changes Jul 14, 2022

View reviewed changes

sumeerbhola force-pushed the flush_tokens branch from 9951fa8 to 53e8c73 Compare July 14, 2022 14:12

sumeerbhola force-pushed the flush_tokens branch from 53e8c73 to 6cad2d5 Compare July 14, 2022 14:14

sumeerbhola commented Jul 14, 2022

View reviewed changes

craig bot merged commit cc5bde5 into cockroachdb:master Jul 14, 2022

cockroach-teamcity mentioned this pull request Jul 14, 2022

admission,storage: introduce flush tokens to constrain write admission cockroachdb/docs#14543

Closed

nicktrav mentioned this pull request Jul 15, 2022

roachtest: sstable-corruption/table failed #84479

Closed

nicktrav mentioned this pull request Aug 2, 2022

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed [admission control] #72375

Closed

	func createAndRegisterUnavailableRangesRule(
	ctx context.Context, ruleRegistry *metric.RuleRegistry,
	) {
	expr := "(sum by(instance, cluster) (ranges_unavailable)) > 0"
	var annotations []metric.LabelPair
	annotations = append(annotations, metric.LabelPair{
	Name: proto.String("summary"),
	Value: proto.String("Instance {{ $labels.instance }} has {{ $value }} unavailable ranges"),
	})
	recommendedHoldDuration := 10 * time.Minute
	help := "This check detects when the number of ranges with less than quorum replicas live are non-zero for too long"

	rule, err := metric.NewAlertingRule(
	unavailableRangesRuleName,
	expr,
	annotations,
	nil,
	recommendedHoldDuration,
	help,
	true,
	)
	MaybeAddRuleToRegistry(ctx, err, unavailableRangesRuleName, rule, ruleRegistry)
	}

Conversation

sumeerbhola commented Jun 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 4, 2022

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sumeerbhola commented Jun 9, 2022

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

joshimhoff commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abarganier left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sumeerbhola commented Jul 6, 2022

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Jul 14, 2022

Uh oh!

craig bot commented Jul 14, 2022

Uh oh!

sumeerbhola commented Jun 4, 2022 •

edited

Loading

joshimhoff commented Jun 14, 2022 •

edited

Loading

tbg commented Jun 16, 2022 •

edited

Loading