Skip to content

admission,storage: introduce flush tokens to constrain write admission#82440

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:flush_tokens
Jul 14, 2022
Merged

admission,storage: introduce flush tokens to constrain write admission#82440
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:flush_tokens

Conversation

@sumeerbhola
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola commented Jun 4, 2022

In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):

  • Measure and smooth the peak rate that the flush loop can operate on.
    This relies on the recently added pebble.InternalIntervalMetrics.
  • The peak rate causes 100% utilization of the single flush thread,
    and that is potentially too high to prevent write stalls (depending
    on how long it takes to do a single flush). So we multiply the
    smoothed peak rate by a utilization-target-fraction which is
    dynamically adjusted and by default is constrained to the interval
    [0.5, 1.5]. There is additive increase and decrease of this
    fraction:
    • High usage of tokens and no write stalls cause an additive increase
    • Write stalls cause an additive decrease. A small multiplier is used
      if there are multiple write stalls, so that the probing falls
      more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:

  • early-life: when the store had 10-20GB of data, when the compaction
    backlog was not very heavy, so there was less queueing for the
    limited disk bandwidth (it was still usually saturated).
  • later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Screenshots for early-life: Flush tokens were turned off at 22:32:30. Prior to that the flush utilization-target-fraction was 0.625.
Screen Shot 2022-06-03 at 6 35 14 PM
Screen Shot 2022-06-03 at 6 35 28 PM

Screenshots for later-life: Flush tokens were turned off at 22:03:20. Prior to that the flush utilization-target-fraction was 0.875.
Screen Shot 2022-06-03 at 6 07 50 PM
Screen Shot 2022-06-03 at 6 08 03 PM

These results were produced by running

roachprod create -n 2 --clouds aws  --aws-machine-type=c5d.9xlarge --local-ssd=false --aws-ebs-volume-type=gp2 sumeer-io
roachprod put sumeer-io:1 cockroach ./cockroach
roachprod put sumeer-io:2 workload ./workload
roachprod start sumeer-io --env "COCKROACH_ROCKSDB_CONCURRENCY=8"
roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-1}

Fixes #77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.

@sumeerbhola sumeerbhola requested a review from a team June 4, 2022 01:15
@sumeerbhola sumeerbhola requested review from a team as code owners June 4, 2022 01:15
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. The IO granter code is getting quite large, though I'm hesitant to suggest breaking it down in this PR.

Could you reword the release note a bit? It's a little choppy. Also, make sure to point out that the setting comes with a default of 0.5 (I guess 50 since we're not doing fractions) and so this protection is on by default.


// MinFlushUtilizationPercentage is a lower-bound on the dynamically adjusted
// flush utilization target that attempts to reduce write stalls. Set it to a
// high percentage, like 1000, to effectively disable flush based tokens.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is using a "0-100" percentage consistent with other knobs or should this be a fractional number?

var MinFlushUtilizationPercentage = settings.RegisterIntSetting(
settings.SystemOnly,
"admission.min_flush_util_percent",
"when computing flush tokens, this percentage is used as a lower bound on the dynamic"+
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?

Maybe

A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.

smoothedIntIngestedAccountedL0BytesFraction float64 // '1.0' means: all ingested bytes went to L0
smoothedTotalNumByteTokens float64 // smoothed history of token bucket sizes
// Smoothed history of byte tokens calculated based on compactions out of L0.
smoothedCompactionByteTokens float64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this is renamed from smoothedTotalNumByteTokens.

// Smoothed history of byte tokens calculated based on compactions out of L0.
smoothedCompactionByteTokens float64

// Number of flush tokens, before multiplying by target fraction.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.

smoothedNumFlushTokens = alpha*intFlushTokens + (1-alpha)*prev.smoothedNumFlushTokens
}
const flushUtilTargetFractionIncrement = 0.025
// Have we used >= 90% of the tokens.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?

// Else avoid overflow by using the previously set unlimitedTokens. This
// should not really happen.
}
// Else intFlushUtilization is too low. We don't want to make token
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...

res.flushUtilTargetFraction, ib(int64(res.smoothedNumFlushTokens*res.flushUtilTargetFraction)))
p.Printf("admitting ")
if n := res.ioLoadListenerState.totalNumByteTokens; n < unlimitedTokens {
p.Printf("%s", ib(n))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're here, can we make this a rate? So something like "%s/s", ib(float64(n)/float64(intervalLength*time.Second))?

p.Printf("compacted %s [≈%s]; ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes))

p.Printf("compacted %s [≈%s], ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes))
p.Printf("flush tokens %s with multiplier %.3f [≈%s]; ", ib(int64(res.aux.intFlushTokens)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right place for the multiplier, no? I would expect something here like

p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))

so basically just printing the flush stats.

Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:

admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

No regression in tpccbench/nodes=3/cpu=16 with this change. Both this and master ended with MAX WAREHOUSES = 3460

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reworded the release note.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)


pkg/kv/kvserver/metrics.go line 492 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Consider adding an alerting rule for this metric. I assume SRE (@joshimhoff) would like to alert on write stalls eventually.

func createAndRegisterUnavailableRangesRule(
ctx context.Context, ruleRegistry *metric.RuleRegistry,
) {
expr := "(sum by(instance, cluster) (ranges_unavailable)) > 0"
var annotations []metric.LabelPair
annotations = append(annotations, metric.LabelPair{
Name: proto.String("summary"),
Value: proto.String("Instance {{ $labels.instance }} has {{ $value }} unavailable ranges"),
})
recommendedHoldDuration := 10 * time.Minute
help := "This check detects when the number of ranges with less than quorum replicas live are non-zero for too long"
rule, err := metric.NewAlertingRule(
unavailableRangesRuleName,
expr,
annotations,
nil,
recommendedHoldDuration,
help,
true,
)
MaybeAddRuleToRegistry(ctx, err, unavailableRangesRuleName, rule, ruleRegistry)
}

Are these for alerts based on prometheus metrics? I am not sure what threshold is reasonable for SRE. I don't think there should be any aggregation here. I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
@joshimhoff do you have an opinion?


pkg/util/admission/granter.go line 61 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is using a "0-100" percentage consistent with other knobs or should this be a fractional number?

Changed to a fraction.


pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?

Maybe

A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.

Done


pkg/util/admission/granter.go line 1635 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

note to self: this is renamed from smoothedTotalNumByteTokens.

Ack


pkg/util/admission/granter.go line 1637 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.

Done


pkg/util/admission/granter.go line 2073 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?

Done


pkg/util/admission/granter.go line 2111 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...

Done


pkg/util/admission/granter.go line 2233 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't the right place for the multiplier, no? I would expect something here like

p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))

so basically just printing the flush stats.

Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:

admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f

Done


pkg/util/admission/granter.go line 2237 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

while we're here, can we make this a rate? So something like "%s/s", ib(float64(n)/float64(intervalLength*time.Second))?

Done

@joshimhoff
Copy link
Copy Markdown
Collaborator

joshimhoff commented Jun 14, 2022

I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
@joshimhoff do you have an opinion?

I agree.

If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right? I base this on:

The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

If we are talking about ~indefinite write stalls induced by disk availability issues on the cloud provider side, I expect that something like #76081 will catch impact if it is significant enough to warrant paging SRE, e.g. if a range becomes unavailable as it did with a ~recent cloud customer. Also we already have controls in place to auto-restart nodes in case of such an incident (and we page SRE on enough restarts).

As always, what SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on.

Copy link
Copy Markdown
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an o11y standpoint, defining an alerting rule was the main point I wanted to bring up. Thanks @tbg for being an advocate!

I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.

SGTM from an SRE-specific alert perspective.

SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on.

I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?

If there's some threshold that, in all cases, would indicate that investigation is required, I think an alerting rule would be a great way for us to provide guidance to customers on how to practically make use of this metric. I realize this isn't always the case, so if it doesn't make sense to do so, that's fine too.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)

@tbg tbg requested review from tbg and removed request for tbg June 16, 2022 17:18
@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 16, 2022

@sumeerbhola please re-request review when you're back and we can push this over the finish line quickly.

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

CI failure is the same flake as #79109

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right?

Yes, it should capture it.

I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?

It's very possible that the write stall rate (seconds/second) that is tolerable is dependent on the workload. And that rate doesn't even give a good sense of the latency impact, which a prober or the customer measuring their own latency would do. Based on this discussion thread I am more inclined now towards treating this new metric as something that is useful primarily for diagnosis, and not for alerting.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)

@sumeerbhola sumeerbhola requested review from abarganier and tbg July 6, 2022 14:25
@sumeerbhola sumeerbhola requested a review from a team as a code owner July 6, 2022 14:36
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but looks like there's a rebase, could you make sure the rebase happens separately from any additions? For example, do the rebase next, and make any changes prompted by my comments in a separate fixup commit (which we can squash before merge). Thanks!

Reviewed 1 of 9 files at r1, 6 of 9 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @sumeerbhola)


-- commits line 78 at r3:
This could use a bit more context, like

Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.


pkg/storage/engine.go line 922 at r2 (raw file):

	// calls to this method. Hence, this should be used with care, with only one
	// caller, which is currently the admission control subsystem.
	GetInternalIntervalMetrics() *pebble.InternalIntervalMetrics

This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.


pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, sumeerbhola wrote…

Done

Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.

@sumeerbhola sumeerbhola force-pushed the flush_tokens branch 2 times, most recently from 5db234a to e302759 Compare July 13, 2022 14:35
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, @sumeerbhola, and @tbg)


-- commits line 78 at r3:

Previously, tbg (Tobias Grieger) wrote…

This could use a bit more context, like

Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.

Done

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)


pkg/storage/engine.go line 922 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.

It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.

// InternalIntervalMetrics exposes metrics about internal subsystems, that can
// be useful for deep observability purposes, and for higher-level admission
// control systems that are trying to estimate the capacity of the DB. These
// are experimental and subject to change, since they expose internal
// implementation details, so do not rely on these without discussion with the
// Pebble team.
// These represent the metrics over the interval of time from the last call to
// retrieve these metrics. These are not cumulative, unlike Metrics. The main
// challenge in making these cumulative is the hdrhistogram.Histogram, which
// does not have the ability to subtract a histogram from a preceding metric
// retrieval.

pkg/util/admission/granter.go line 65 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.

Done (duplicated more of the code comment into the setting help).

// typically the more persistent bottleneck than flushes for the following
// reason:
// There is a dedicated flush thread. In comparison, with a write amp of W
// and maximum compaction concurrency of C, only approximately C/W threads
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own learning, how did you arrive at the C/W figure for the number of L0 readers?

}
flushTokensFloat := flushUtilTargetFraction * smoothedNumFlushTokens
if flushTokensFloat < float64(math.MaxInt64) {
numFlushTokens = int64(flushTokensFloat)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?

// 0.16 to 0.9 in a single interval, when compaction tokens are not limited,
// hence we have set flushUtilIgnoreThreshold to a very low value. If we've
// erred towards it being too low, we run the risk of computing incorrect
// tokens. If we've erred towards being too high, we run the risk of giving
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: If we're too low, we're risking underutilization?

tick: 0, setAvailableIOTokens: unlimited

# Flush loop utilization is high enough.
set-state l0-bytes=10000 l0-added=11000 l0-files=1 l0-sublevels=1 flush-bytes=1000 flush-work-sec=2 flush-idle-sec=10 print-only-first-tick=true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] These tests would be easier to read if the set-state directive was such that it kept all the previous values on overwrote only those stated (set-state write-stall-count=1 in this example)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r4, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)


pkg/storage/engine.go line 922 at r2 (raw file):

Previously, sumeerbhola wrote…

It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.

// InternalIntervalMetrics exposes metrics about internal subsystems, that can
// be useful for deep observability purposes, and for higher-level admission
// control systems that are trying to estimate the capacity of the DB. These
// are experimental and subject to change, since they expose internal
// implementation details, so do not rely on these without discussion with the
// Pebble team.
// These represent the metrics over the interval of time from the last call to
// retrieve these metrics. These are not cumulative, unlike Metrics. The main
// challenge in making these cumulative is the hdrhistogram.Histogram, which
// does not have the ability to subtract a histogram from a preceding metric
// retrieval.

I don't understand, isn't the hdrhistogram returned in *InternalIntervalMetrics cumulative, then? Looking at the impl of GetInternalIntervalMetrics() it looks like we're just attaching a long-lived hdrhistogram, so no resetting is taking place.
However, we are resetting flushWriteThroughput but this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.

Besides, one can take deltas of hdrhistograms:

	var g, h *hdrhistogram.Histogram // assumed to use same parameters
	snapg, snaph := g.Export(), h.Export()
	for i := range snapg.Counts {
		snaph.Counts[i] -= snapg.Counts[i]
	}
	delta := hdrhistogram.Import(snaph) // delta == h-g

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with fixing up the PebbleIntervalMetrics separately.

In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): I/O admission control now reduces the
likelihood of storage layer write stalls, which can be caused by
memtable flushes becoming a bottleneck. This is done by limiting
write tokens based on flush throughput, so as to reduce storage
layer write stalls. Consequently, write tokens are now limited
both by flush throughput, and by compaction throughput out of L0.
This behavior is enabled by default. The new cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens. Setting
it to a value much much greater than 1, say 10, will disable flush
based tokens. Tuning the behavior, without disabling it, should
be done only on the recommendation of a domain expert.
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with fixing up the PebbleIntervalMetrics separately.
Ack

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)


pkg/storage/engine.go line 922 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't understand, isn't the hdrhistogram returned in *InternalIntervalMetrics cumulative, then? Looking at the impl of GetInternalIntervalMetrics() it looks like we're just attaching a long-lived hdrhistogram, so no resetting is taking place.
However, we are resetting flushWriteThroughput but this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.

Besides, one can take deltas of hdrhistograms:

	var g, h *hdrhistogram.Histogram // assumed to use same parameters
	snapg, snaph := g.Export(), h.Export()
	for i := range snapg.Counts {
		snaph.Counts[i] -= snapg.Counts[i]
	}
	delta := hdrhistogram.Import(snaph) // delta == h-g

No, the hdrhistogram is not cumulative. Since once we grab d.mu.log.metrics we set it to nil

		m.LogWriter.SyncLatencyMicros = d.mu.log.metrics.SyncLatencyMicros
		d.mu.log.metrics = nil

Thanks for the pointer regarding deltas -- I should have looked more closely at the implementation (the sub-bucket stuff threw me off and I thought it was doing data dependent buckets, and not deterministic buckets based on the initial parameters).


pkg/util/admission/granter.go line 2285 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my own learning, how did you arrive at the C/W figure for the number of L0 readers?

I've modified this comment
// There is a dedicated flush thread. With a maximum compaction concurrency
// of C, we have up to C threads dedicated to handling the write-amp of W
// (caused by rewriting the same data). So C/(W-1) threads on average are
// reading the original data (that will be rewritten W-1 times). Since L0
// can have multiple overlapping files, and intra-L0 compactions are usually
// avoided, we can assume (at best) that the original data (in L0) is being
// read only when compacting to levels lower than L0. That is, C/(W-1)
// threads are reading from L0 to compact to levels lower than L0. Since W
// can be 20+ and C defaults to 3 (we plan to dynamically adjust C but one
// can expect C to be <= 10), C/(W-1) < 1. So the main reason we are
// considering flush tokens is transient flush bottlenecks, and workloads
// where W is small.


pkg/util/admission/granter.go line 2364 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?

If someone finds it useful. To understand what is really happening with these tokens, I needed to look at multiple numbers, hence the log statement printed every 15s.


pkg/util/admission/granter.go line 2375 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my own understanding: If we're too low, we're risking underutilization?

This is talking about flushUtilIgnoreThreshold. It that is too low we may have cases of intFlushUtilization > flushUtilIgnoreThreshold that result in poor quality token computation. Whether that poor quality errs towards fewer tokens than appropriate or more tokens than appropriate, I don't know.


pkg/util/admission/testdata/io_load_listener line 320 at r6 (raw file):
It's sometimes hard to understand a datadriven test if each directive is not self contained -- one has to look at the whole test history. Currently one has to look at the previous directive, since many of these values are cumulative. One could make a case to take the delta of these cumulative values in the directive, to make it more self contained. I'm going to leave such debatable changes for a future PR.

With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.

The lack of comments was an oversight. I've added them now.

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

I've squashed the commits -- I plan to merge when CI is green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

bors r=tbg,irfansharif

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 14, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 14, 2022

Build failed:

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

TestWorkload flaked -- #78478

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 14, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 14, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

admission control: consider write stalls due to memtable count

6 participants