admission,storage: introduce flush tokens to constrain write admission#82440
admission,storage: introduce flush tokens to constrain write admission#82440craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
tbg
left a comment
There was a problem hiding this comment.
This looks pretty good. The IO granter code is getting quite large, though I'm hesitant to suggest breaking it down in this PR.
Could you reword the release note a bit? It's a little choppy. Also, make sure to point out that the setting comes with a default of 0.5 (I guess 50 since we're not doing fractions) and so this protection is on by default.
pkg/util/admission/granter.go
Outdated
|
|
||
| // MinFlushUtilizationPercentage is a lower-bound on the dynamically adjusted | ||
| // flush utilization target that attempts to reduce write stalls. Set it to a | ||
| // high percentage, like 1000, to effectively disable flush based tokens. |
There was a problem hiding this comment.
is using a "0-100" percentage consistent with other knobs or should this be a fractional number?
pkg/util/admission/granter.go
Outdated
| var MinFlushUtilizationPercentage = settings.RegisterIntSetting( | ||
| settings.SystemOnly, | ||
| "admission.min_flush_util_percent", | ||
| "when computing flush tokens, this percentage is used as a lower bound on the dynamic"+ |
There was a problem hiding this comment.
Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?
Maybe
A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.
| smoothedIntIngestedAccountedL0BytesFraction float64 // '1.0' means: all ingested bytes went to L0 | ||
| smoothedTotalNumByteTokens float64 // smoothed history of token bucket sizes | ||
| // Smoothed history of byte tokens calculated based on compactions out of L0. | ||
| smoothedCompactionByteTokens float64 |
There was a problem hiding this comment.
note to self: this is renamed from smoothedTotalNumByteTokens.
pkg/util/admission/granter.go
Outdated
| // Smoothed history of byte tokens calculated based on compactions out of L0. | ||
| smoothedCompactionByteTokens float64 | ||
|
|
||
| // Number of flush tokens, before multiplying by target fraction. |
There was a problem hiding this comment.
// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.
pkg/util/admission/granter.go
Outdated
| smoothedNumFlushTokens = alpha*intFlushTokens + (1-alpha)*prev.smoothedNumFlushTokens | ||
| } | ||
| const flushUtilTargetFractionIncrement = 0.025 | ||
| // Have we used >= 90% of the tokens. |
There was a problem hiding this comment.
// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?
| // Else avoid overflow by using the previously set unlimitedTokens. This | ||
| // should not really happen. | ||
| } | ||
| // Else intFlushUtilization is too low. We don't want to make token |
There was a problem hiding this comment.
// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...
pkg/util/admission/granter.go
Outdated
| res.flushUtilTargetFraction, ib(int64(res.smoothedNumFlushTokens*res.flushUtilTargetFraction))) | ||
| p.Printf("admitting ") | ||
| if n := res.ioLoadListenerState.totalNumByteTokens; n < unlimitedTokens { | ||
| p.Printf("%s", ib(n)) |
There was a problem hiding this comment.
while we're here, can we make this a rate? So something like "%s/s", ib(float64(n)/float64(intervalLength*time.Second))?
pkg/util/admission/granter.go
Outdated
| p.Printf("compacted %s [≈%s]; ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes)) | ||
|
|
||
| p.Printf("compacted %s [≈%s], ", ib(res.aux.intL0CompactedBytes), ib(res.smoothedIntL0CompactedBytes)) | ||
| p.Printf("flush tokens %s with multiplier %.3f [≈%s]; ", ib(int64(res.aux.intFlushTokens)), |
There was a problem hiding this comment.
This isn't the right place for the multiplier, no? I would expect something here like
p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))so basically just printing the flush stats.
Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:
admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f
|
No regression in |
585b9b4 to
46675ac
Compare
46675ac to
78f2dc2
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTR!
Reworded the release note.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)
pkg/kv/kvserver/metrics.go line 492 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Consider adding an alerting rule for this metric. I assume SRE (@joshimhoff) would like to alert on write stalls eventually.
cockroach/pkg/kv/kvserver/metric_rules.go
Lines 57 to 79 in af221d4
Are these for alerts based on prometheus metrics? I am not sure what threshold is reasonable for SRE. I don't think there should be any aggregation here. I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
@joshimhoff do you have an opinion?
pkg/util/admission/granter.go line 61 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
is using a "0-100" percentage consistent with other knobs or should this be a fractional number?
Changed to a fraction.
pkg/util/admission/granter.go line 65 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you give an example here as well? I think as is fairly cryptic. Nobody will know what to do with this knob; would setting it to 1 be too much? What are acceptable low/high values? In which direction should the knob be moved if write stalls occur (down)?
Maybe
A lower bound on the multiplier which is applied to the observed peak memtable flush rate which factors into IO admission control to reduce the number of write stalls. A large value (such as 1000) effectively disables admission control based on flush rate.
Done
pkg/util/admission/granter.go line 1635 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
note to self: this is renamed from smoothedTotalNumByteTokens.
Ack
pkg/util/admission/granter.go line 1637 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// Smoothed history of flush tokens calculated based on memtable flushes, before multiplying by target fraction.
Done
pkg/util/admission/granter.go line 2073 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// Have we used, over the last (15s) cycle, more than 90% of the tokens we would give out for the next cycle?
Done
pkg/util/admission/granter.go line 2111 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// Else intFlushUtilization is too low. We don't want to make token determination based on a very low utilization, so we hand out unlimited tokens. Note that ...
Done
pkg/util/admission/granter.go line 2233 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't the right place for the multiplier, no? I would expect something here like
p.Printf("flushed %s [≈%s]; ", ib(int64(res.aux.intFlushTokens)), ib(res.smoothedIntFlushTokens))so basically just printing the flush stats.
Then, in the "admitting" section, it should say whether the token count was derived from L0 or from flushes:
admitting 4.02mb/s due to L0 growth with L0 penalty: X, Y, flush multiplier %.3f
admitting 4.02mb/s due to memtable flush with L0 penalty: X, Y, flush multiplier %.3f
Done
pkg/util/admission/granter.go line 2237 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
while we're here, can we make this a rate? So something like
"%s/s", ib(float64(n)/float64(intervalLength*time.Second))?
Done
I agree. If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right? I base this on:
If we are talking about ~indefinite write stalls induced by disk availability issues on the cloud provider side, I expect that something like #76081 will catch impact if it is significant enough to warrant paging SRE, e.g. if a range becomes unavailable as it did with a ~recent cloud customer. Also we already have controls in place to auto-restart nodes in case of such an incident (and we page SRE on enough restarts). As always, what SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on. |
abarganier
left a comment
There was a problem hiding this comment.
From an o11y standpoint, defining an alerting rule was the main point I wanted to bring up. Thanks @tbg for being an advocate!
I am inclined towards waiting for some incidents where this metric turned out to be useful as an early signal of badness.
SGTM from an SRE-specific alert perspective.
SRE should alert on != what the customer should alert on. Any signs of overload at all is a good thing for the customer to alert on.
I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?
If there's some threshold that, in all cases, would indicate that investigation is required, I think an alerting rule would be a great way for us to provide guidance to customers on how to practically make use of this metric. I realize this isn't always the case, so if it doesn't make sense to do so, that's fine too.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)
|
@sumeerbhola please re-request review when you're back and we can push this over the finish line quickly. |
|
CI failure is the same flake as #79109 |
sumeerbhola
left a comment
There was a problem hiding this comment.
If we are talking about write stalls induced by resource overload, I expect the kvprober shadow write latency histogram will actually capture this fairly well. Right?
Yes, it should capture it.
I also agree with this - are we sure there's not a conservative alerting rule we can define here to indicate when things are going south? Or is storage.write-stall-nanos dependent enough on cluster configuration & workload where there's no clear "one size fits all"?
It's very possible that the write stall rate (seconds/second) that is tolerable is dependent on the workload. And that rate doesn't even give a good sense of the latency impact, which a prober or the customer measuring their own latency would do. Based on this discussion thread I am more inclined now towards treating this new metric as something that is useful primarily for diagnosis, and not for alerting.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, and @tbg)
tbg
left a comment
There was a problem hiding this comment.
but looks like there's a rebase, could you make sure the rebase happens separately from any additions? For example, do the rebase next, and make any changes prompted by my comments in a separate fixup commit (which we can squash before merge). Thanks!
Reviewed 1 of 9 files at r1, 6 of 9 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @sumeerbhola)
-- commits line 78 at r3:
This could use a bit more context, like
Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.
pkg/storage/engine.go line 922 at r2 (raw file):
// calls to this method. Hence, this should be used with care, with only one // caller, which is currently the admission control subsystem. GetInternalIntervalMetrics() *pebble.InternalIntervalMetrics
This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.
pkg/util/admission/granter.go line 65 at r1 (raw file):
Previously, sumeerbhola wrote…
Done
Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.
5db234a to
e302759
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, @sumeerbhola, and @tbg)
Previously, tbg (Tobias Grieger) wrote…
This could use a bit more context, like
Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls which would result from memtable flushes becoming a bottleneck by suitably restricting admission tokens. The new cluster setting admission.min_flush_util_fraction disables or tunes this mechanism; it is enabled by default and can be disabled by setting to a large value such as 10. It should not be changed without recommendation by a domain expert.
Done
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)
pkg/storage/engine.go line 922 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This seems a little odd, why don't we return a cumulative metric here? The timeseries would then look at the non-neg rate of this and this seems to sidestep the frailty of this method.
It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.
// InternalIntervalMetrics exposes metrics about internal subsystems, that can
// be useful for deep observability purposes, and for higher-level admission
// control systems that are trying to estimate the capacity of the DB. These
// are experimental and subject to change, since they expose internal
// implementation details, so do not rely on these without discussion with the
// Pebble team.
// These represent the metrics over the interval of time from the last call to
// retrieve these metrics. These are not cumulative, unlike Metrics. The main
// challenge in making these cumulative is the hdrhistogram.Histogram, which
// does not have the ability to subtract a histogram from a preceding metric
// retrieval.
pkg/util/admission/granter.go line 65 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Sorry to nitpick again, but should the explanations be in the setting help and not in a comment? After all, they will be harder to reach if they are only in the source.
Done (duplicated more of the code comment into the setting help).
pkg/util/admission/granter.go
Outdated
| // typically the more persistent bottleneck than flushes for the following | ||
| // reason: | ||
| // There is a dedicated flush thread. In comparison, with a write amp of W | ||
| // and maximum compaction concurrency of C, only approximately C/W threads |
There was a problem hiding this comment.
For my own learning, how did you arrive at the C/W figure for the number of L0 readers?
| } | ||
| flushTokensFloat := flushUtilTargetFraction * smoothedNumFlushTokens | ||
| if flushTokensFloat < float64(math.MaxInt64) { | ||
| numFlushTokens = int64(flushTokensFloat) |
There was a problem hiding this comment.
Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?
| // 0.16 to 0.9 in a single interval, when compaction tokens are not limited, | ||
| // hence we have set flushUtilIgnoreThreshold to a very low value. If we've | ||
| // erred towards it being too low, we run the risk of computing incorrect | ||
| // tokens. If we've erred towards being too high, we run the risk of giving |
There was a problem hiding this comment.
For my own understanding: If we're too low, we're risking underutilization?
| tick: 0, setAvailableIOTokens: unlimited | ||
|
|
||
| # Flush loop utilization is high enough. | ||
| set-state l0-bytes=10000 l0-added=11000 l0-files=1 l0-sublevels=1 flush-bytes=1000 flush-work-sec=2 flush-idle-sec=10 print-only-first-tick=true |
There was a problem hiding this comment.
[nit] These tests would be easier to read if the set-state directive was such that it kept all the previous values on overwrote only those stated (set-state write-stall-count=1 in this example)
There was a problem hiding this comment.
With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.
tbg
left a comment
There was a problem hiding this comment.
Reviewed 10 of 10 files at r4, 1 of 1 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)
pkg/storage/engine.go line 922 at r2 (raw file):
Previously, sumeerbhola wrote…
It's the histogram (which we don't currently use) -- see the following comment from pebble/metrics.go.
// InternalIntervalMetrics exposes metrics about internal subsystems, that can // be useful for deep observability purposes, and for higher-level admission // control systems that are trying to estimate the capacity of the DB. These // are experimental and subject to change, since they expose internal // implementation details, so do not rely on these without discussion with the // Pebble team. // These represent the metrics over the interval of time from the last call to // retrieve these metrics. These are not cumulative, unlike Metrics. The main // challenge in making these cumulative is the hdrhistogram.Histogram, which // does not have the ability to subtract a histogram from a preceding metric // retrieval.
I don't understand, isn't the hdrhistogram returned in *InternalIntervalMetrics cumulative, then? Looking at the impl of GetInternalIntervalMetrics() it looks like we're just attaching a long-lived hdrhistogram, so no resetting is taking place.
However, we are resetting flushWriteThroughput but this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.
Besides, one can take deltas of hdrhistograms:
var g, h *hdrhistogram.Histogram // assumed to use same parameters
snapg, snaph := g.Export(), h.Export()
for i := range snapg.Counts {
snaph.Counts[i] -= snapg.Counts[i]
}
delta := hdrhistogram.Import(snaph) // delta == h-g
tbg
left a comment
There was a problem hiding this comment.
I'm okay with fixing up the PebbleIntervalMetrics separately.
9951fa8 to
53e8c73
Compare
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.
The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
and that is potentially too high to prevent write stalls (depending
on how long it takes to do a single flush). So we multiply the
smoothed peak rate by a utilization-target-fraction which is
dynamically adjusted and by default is constrained to the interval
[0.5, 1.5]. There is additive increase and decrease of this
fraction:
- High usage of tokens and no write stalls cause an additive increase.
- Write stalls cause an additive decrease. A small multiplier is used
if there are multiple write stalls, so that the probing falls
more in the region where there are no write stalls.
Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.
For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.
There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.
The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.
The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
backlog was not very heavy, so there was less queueing for the
limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.
In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.
In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).
Fixes cockroachdb#77357
Release note (ops change): I/O admission control now reduces the
likelihood of storage layer write stalls, which can be caused by
memtable flushes becoming a bottleneck. This is done by limiting
write tokens based on flush throughput, so as to reduce storage
layer write stalls. Consequently, write tokens are now limited
both by flush throughput, and by compaction throughput out of L0.
This behavior is enabled by default. The new cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens. Setting
it to a value much much greater than 1, say 10, will disable flush
based tokens. Tuning the behavior, without disabling it, should
be done only on the recommendation of a domain expert.
53e8c73 to
6cad2d5
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
I'm okay with fixing up the PebbleIntervalMetrics separately.
Ack
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)
pkg/storage/engine.go line 922 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't understand, isn't the
hdrhistogramreturned in*InternalIntervalMetricscumulative, then? Looking at the impl ofGetInternalIntervalMetrics()it looks like we're just attaching a long-livedhdrhistogram, so no resetting is taking place.
However, we are resettingflushWriteThroughputbut this seems unnecessary, since the caller knows how to subtract these from each other (it's just a collection of cumulative fields). So I don't think what is blocking this method from just being "canonical" in returning cumulatives entirely.Besides, one can take deltas of hdrhistograms:
var g, h *hdrhistogram.Histogram // assumed to use same parameters snapg, snaph := g.Export(), h.Export() for i := range snapg.Counts { snaph.Counts[i] -= snapg.Counts[i] } delta := hdrhistogram.Import(snaph) // delta == h-g
No, the hdrhistogram is not cumulative. Since once we grab d.mu.log.metrics we set it to nil
m.LogWriter.SyncLatencyMicros = d.mu.log.metrics.SyncLatencyMicros
d.mu.log.metrics = nil
Thanks for the pointer regarding deltas -- I should have looked more closely at the implementation (the sub-bucket stuff threw me off and I thought it was doing data dependent buckets, and not deterministic buckets based on the initial parameters).
pkg/util/admission/granter.go line 2285 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
For my own learning, how did you arrive at the C/W figure for the number of L0 readers?
I've modified this comment
// There is a dedicated flush thread. With a maximum compaction concurrency
// of C, we have up to C threads dedicated to handling the write-amp of W
// (caused by rewriting the same data). So C/(W-1) threads on average are
// reading the original data (that will be rewritten W-1 times). Since L0
// can have multiple overlapping files, and intra-L0 compactions are usually
// avoided, we can assume (at best) that the original data (in L0) is being
// read only when compacting to levels lower than L0. That is, C/(W-1)
// threads are reading from L0 to compact to levels lower than L0. Since W
// can be 20+ and C defaults to 3 (we plan to dynamically adjust C but one
// can expect C to be <= 10), C/(W-1) < 1. So the main reason we are
// considering flush tokens is transient flush bottlenecks, and workloads
// where W is small.
pkg/util/admission/granter.go line 2364 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Similar to the number of available CPU slots, should we export the number of flush tokens and compaction byte tokens as metrics?
If someone finds it useful. To understand what is really happening with these tokens, I needed to look at multiple numbers, hence the log statement printed every 15s.
pkg/util/admission/granter.go line 2375 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
For my own understanding: If we're too low, we're risking underutilization?
This is talking about flushUtilIgnoreThreshold. It that is too low we may have cases of intFlushUtilization > flushUtilIgnoreThreshold that result in poor quality token computation. Whether that poor quality errs towards fewer tokens than appropriate or more tokens than appropriate, I don't know.
pkg/util/admission/testdata/io_load_listener line 320 at r6 (raw file):
It's sometimes hard to understand a datadriven test if each directive is not self contained -- one has to look at the whole test history. Currently one has to look at the previous directive, since many of these values are cumulative. One could make a case to take the delta of these cumulative values in the directive, to make it more self contained. I'm going to leave such debatable changes for a future PR.
With these test cases below, what most specifically is of interest? The fact that multiplier observes steady decrements with each write stall? If there was a way to only emit the delta from the last printout that would make these tests easier to follow.
The lack of comments was an oversight. I've added them now.
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTRs!
I've squashed the commits -- I plan to merge when CI is green.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @irfansharif, and @tbg)
|
bors r=tbg,irfansharif |
|
Build failed (retrying...): |
|
Build failed: |
|
TestWorkload flaked -- #78478 |
|
bors retry |
|
Build failed (retrying...): |
|
Build succeeded: |
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.
The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
This relies on the recently added pebble.InternalIntervalMetrics.
and that is potentially too high to prevent write stalls (depending
on how long it takes to do a single flush). So we multiply the
smoothed peak rate by a utilization-target-fraction which is
dynamically adjusted and by default is constrained to the interval
[0.5, 1.5]. There is additive increase and decrease of this
fraction:
if there are multiple write stalls, so that the probing falls
more in the region where there are no write stalls.
Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.
For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.
There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.
The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.
The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
backlog was not very heavy, so there was less queueing for the
limited disk bandwidth (it was still usually saturated).
In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.
In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).
Screenshots for early-life: Flush tokens were turned off at 22:32:30. Prior to that the flush utilization-target-fraction was 0.625.


Screenshots for later-life: Flush tokens were turned off at 22:03:20. Prior to that the flush utilization-target-fraction was 0.875.


These results were produced by running
Fixes #77357
Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.