kvserver: check L0 sub-levels on allocation by kvoli · Pull Request #78608 · cockroachdb/cockroach

kvoli · 2022-03-28T13:57:34Z

Previously, the only store health signal used as a hard allocation and
rebalancing constraint was disk capacity. This patch introduces L0
sub-levels as an additional constraint, to avoid allocation and
rebalancing to replicas to stores which are unhealthy, indicated by a
high number of L0 sub-levlels.

A store's sub-level count must exceed both the (1) threshold and (2)
cluster mean watermark (10% above mean) in order to be considered
unhealthy. The average check ensures that a cluster full of moderately
high read amplification stores is not unable to make progress, whilst
still ensuring that positively skewed distributions exclude the positive
tail.

Simulation of the effect on candidate exclusion under different L0
sub-level distributions by using the mean as an additional check vs
percentiles can be found here:
https://gist.github.com/kvoli/be27efd4662e89e8918430a9c7117858

The L0 sub-level value for each store descriptor is smoothed to be the
maximum over a 5 minute window:

The threshold corresponds to the cluster setting
kv.allocator.L0_sublevels_threshold, which is the number of L0
sub-levels, that when a candidate store exceeds it will be potentially
excluded as a target for rebalancing, or both rebalancing and allocation
of replicas.

The enforcement of this threshold can be applied under 4 different
levels of strictness. This is configured by the cluster setting:
kv.allocator.L0_sublevels_threshold_enforce.

The 4 levels are:

block_none: L0 sub-levels is ignored entirely.
block_none_log: L0 sub-levels are logged if threshold exceeded.

Both states below log as above.

block_rebalance_to: L0 sub-levels are considered when excluding stores
for rebalance targets.
block_all: L0 sub-levels are considered when excluding stores for
rebalance targets and allocation targets.

By default, kv.allocator.L0_sublevels_threshold is 20. Which
corresponds to admissions control's threshold, above which it begins
limiting admission of work to a store based on store health. The default
enforcement level of kv.allocator.L0_sublevels_threshold_enforce is
block_none_log.

resolves #73714

Release justification: low risk, high benefit during high read
amplification scenarios where an operator may limit rebalancing to high
read amplification stores, to stop fueling the flame.

Release note (ops change): introduce cluster settings
kv.allocator.l0_sublevels_threshold and
kv.allocator.L0_sublevels_threshold_enforce, which enable excluding
stores as targets for allocation and rebalancing of replicas when they
have high read amplification, indicated by the number of L0 sub-levels
in level 0 of the store's LSM. When both
kv.allocator.l0_sublevels_threshold and the cluster average is
exceeded, the action corresponding to
kv.allocator.l0_sublevels_threshold_enforce is taken. block_none
will exclude no candidate stores, block_none_log will exclude no
candidates but log an event, block_rebalance_to will exclude
candidates stores from being targets of rebalance actions, block_all
will exclude candidate stores from being targets of both allocation and
rebalancing. Default kv.allocator.l0_sublevels_threshold is set to
20 and kv.allocator.l0_sublevels_threshold_enforce is set to
block_none_log.

cockroach-teamcity · 2022-03-28T13:57:43Z

This change is

tbg

Very partial review (just came here to see the high level picture).
I think there's an opportunity - somewhere where folks will find it so maybe on the cluster setting comment - to link to this issue because not everyone may be mentally connecting "L0Sublevels" and "Read-Amp".

I also wonder if in light of recent events we also need to consider widening the heuristic from "read amp" to generally high L0 file counts (if you have 500k files in L0, even with a read-amp of 5, things can be pretty bad). But I assume you have considered that, would be nice to read about the rationale somewhere.

tbg · 2022-03-30T07:02:03Z

pkg/kv/kvserver/allocator_scorer.go

+	settings.SystemOnly,
+	"kv.allocator.l0_sublevels_threshold_enforce",
+	"the level of enforcement when a candidate disk has read amplification exceeding"+
+		"`kv.allocator.l0_sublevels_threshold`; disabled will take no log will take no action,"+


something is garbled here.

Updated to be clearer.

tbg · 2022-03-30T07:02:30Z

pkg/kv/kvserver/allocator_scorer.go

+// excluded when a candidate disk is considered unhealthy.
+type storeHealthEnforcement int64
+
+const (


Generally the verbiage here and on the const declarations is a bit ambiguous, what are "actions to be included or excluded"? Maybe phrase it in terms of how many safeguards are applied. Off = apply no safeguards, log = apply no safeguards but log when safeguards might make sense, rebalanceOnly = apply safeguards when picking rebalance target, allocate = apply during rebalancing and .. I'm not exactly sure what the other cases are, if a store goes unhealthy, would be be proactively cleaning it off replicas? I don't think so, but am not sure what you mean then. Is this referring to pure up/downreplication? Could be clearer.

Generally the verbiage here and on the const declarations is a bit ambiguous, what are "actions to be included or excluded"? Maybe phrase it in terms of how many safeguards are applied. Off = apply no safeguards, log = apply no safeguards but log when safeguards might make sense, rebalanceOnly = apply safeguards when picking rebalance target, allocate = apply during rebalancing

You're right. I've updated this to be clearer about the action taken at each level of enforcement. Now it should be clear the distinction between excluding candidates for as rebalance targets vs excluding candidates as targets entirely (for both allocation and rebalancing).

I'm not exactly sure what the other cases are, if a store goes unhealthy, would be be proactively cleaning it off replicas? I don't think so,

You're right, this patch explicitly excludes removing replicas based on read amplification of the store. I've updated with a comment where this is done to make this explicit.

tbg · 2022-03-30T07:02:46Z

pkg/kv/kvserver/allocator_scorer.go

+
+	// maxL0SublevelThreshold: if the number of l0 sublevels of a store descriptor
+	// is greater than this value, it will never be used as a rebalance target.
+	maxL0SublevelThreshold = 20


I think it says 10 on the PR release note, just a heads up to make sure the numbers match at the end.

I believe this is 20 in the release note, I've updated this to be clearer and reworked the language.

aayushshah15

Though I'd urge us to reconsider backporting this to any of the existing release branches (I see that the PR has a backport-21.1.X label, is that intentional?). Even for backporting to 22.1, this is quite a risky patch for this point in the release and our CI + nightly suite doesn't do much to build confidence towards these kinds of changes (for instance, the issue with the interaction between high read amp stores and the computation of equivalence classes wouldn't have been caught until things went south on a production cluster).

On a general level, I'm concerned that we haven't precisely defined what our expectation is from the rebalancing subsystem when we do have one of these incidents with high read-amp. If we enable the final storeHealthBlockAll level added in this PR, the system will indiscriminately move replicas off of a high read-amp node until the read-amp goes down. Depending on the cause behind the high read-amp, that would mean that essentially a vast majority of a nodes replicas will be shed off. Then, if the read-amp does drop, the system will start moving a bunch of replicas back to this now-healthy looking node. The core issue here seems to be that there is a lot of hysteresis between the actions of the rebalancing subsystem and its eventual effect on read-amp, and thus, it's unclear to me if the KV allocator (as it currently exists) is the way to solve this problem. @sumeerbhola, do you have any thoughts on these concerns?

Lately, every single issue/escalation that's come up has had different root-causes (a lot of which were straight-up bugs in pebble/admission-control/KV), and it almost seems like the system opaquely moving replicas back and forth would've made things worse in those instances.

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @kvoli, @lidorcarmel, @nvanbenschoten, @shralex, @sumeerbhola, and @tbg)

pkg/kv/kvserver/allocator_scorer.go, line 1369 at r18 (raw file):

			// this stage, in additon to hard checks and validation.
			// TODO(kvoli,ayushshah15): Refactor this to make it harder to
			// inadvertantly break the invariant above,

inadvertently*

kvoli

I think being off by default (log only) and having a mean check is enough insurance to avoid any poor outcomes in backporting this to 22.1.

If we enable the final storeHealthBlockAll level added in this PR, the system will indiscriminately move replicas off of a high read-amp node until the read-amp goes down.

This is not the case. It will not cause high read amplification stores to shed replicas. See the tests that assert on this:

https://github.com/kvoli/cockroach/blob/a1f4ca6e392338002f65ad4e7a73dbba50d6041f/pkg/kv/kvserver/allocator_test.go#L4346-L4367
https://github.com/kvoli/cockroach/blob/a1f4ca6e392338002f65ad4e7a73dbba50d6041f/pkg/kv/kvserver/store_rebalancer_test.go#L1564-L1545

I agree with the sentiment that it is not reasonable to assume that we have sufficient test coverage to understand every interaction yet. However given that this patch is a constraint and not fundamentally changing the ranking or notion of balance, I think we can be sufficiently assured it is okay. In addition to a mean check, which prevents configurations where no progress can be made.

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @kvoli, @lidorcarmel, @nvanbenschoten, @shralex, @sumeerbhola, and @tbg)

pkg/kv/kvserver/allocator_scorer.go, line 1369 at r18 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

inadvertently*

Updated.

kvoli · 2022-04-08T19:03:35Z

TFTRs! 🤠

kvoli · 2022-04-08T19:39:51Z

bors r+

craig · 2022-04-08T20:18:25Z

Build failed (retrying...):

GitHub CI (Cockroach)

kvoli · 2022-04-08T23:40:34Z

bor r-

aayushshah15 · 2022-04-08T23:52:09Z

I think being off by default (log only) and having a mean check is enough insurance to avoid any poor outcomes in backporting this to 22.1.

I mostly agree that the PR in its current incantation (i.e. with storeHealthBlockTo disabled) doesn't seem very risky, but being "off by default" doesn't mean much. If there are issues on a production cluster, operators / TSEs will reach for it. If we want it to be truly "log only" then we should remove the cluster setting and have it be statically log only.

This is not the case. It will not cause high read amplification stores to shed replicas.

Well, yes, but just because this PR doesn't yet let storeHealthBlockTo be used doesn't mean the machinery isn't there. The point I was hoping to make is that we shouldn't be backporting machinery for an approach we're not fully sure is a feasible one.

shralex · 2022-04-09T00:19:18Z

I think we want it to be log-only for now, while we do more testing. Assuming sufficient testing, avoiding sending more work to already overloaded nodes doesn't seem like a very risky notion. It's a bit hard to discuss pros and cons of an algorithm that wasn't designed or implemented (shedding replicas and moving them back based on read amp). Of course thrashing would need to be avoided by such a design. But it still seems worth while doing, see e.g., https://cockroachlabs.slack.com/archives/CJCR55PUG/p1648781917543139?thread_ts=1648744729.730979&cid=CJCR55PUG

aayushshah15 · 2022-04-09T00:31:58Z

It's a bit hard to discuss pros and cons of an algorithm that wasn't designed or implemented (shedding replicas and moving them back based on read amp).

It is designed and implemented in this PR, just statically disabled, so I don't think its hard to discuss the pros and cons here.

But it still seems worth while doing

For sure, it is a problem that needs to be solved. I'm just questioning using our current decentralized allocator to do it.

craig · 2022-04-09T01:14:27Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-04-09T06:51:15Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-04-09T08:23:16Z

Build failed:

GitHub CI (Cockroach)

kvoli · 2022-04-09T08:32:28Z

bors r-

kvoli · 2022-04-09T17:35:29Z

It's a bit hard to discuss pros and cons of an algorithm that wasn't designed or implemented (shedding replicas and moving them back based on read amp).

It is designed and implemented in this PR, just statically disabled, so I don't think its hard to discuss the pros and cons here.

It wont shed them in this PR even with the highest enforcement setting block_all or second highest block_rebalance_to. Look at the tests that assert this by having uniform stores on all other stats, one store or more voters with >100 L0 sub-levels and assertion that no action is taken when block_all or block_rebalance_to are set, despite there being candidates with low read amp:

kvoli/cockroach@a1f4ca6/pkg/kv/kvserver/allocator_test.go#L4346-L4367
kvoli/cockroach@a1f4ca6/pkg/kv/kvserver/store_rebalancer_test.go#L1564-L1545

Unless I'm missing something? No matter what setting, it won't shed replicas due to L0 sub-levels - that was what we agreed from the beginning .

Unrelated but I'm going to remove the cluster version gate and target the successor to when L0 sublevels started to be gossiped - that way it can backport to 22.1.1 or 22.1.2.

Previously, the only store health signal used as a hard allocation and rebalancing constraint was disk capacity. This patch introduces L0 sub-levels as an additional constraint, to avoid allocation and rebalancing to replicas to stores which are unhealthy, indicated by a high number of L0 sub-levels. A store's sub-level count must exceed both the (1) threshold and (2) cluster in order to be considered unhealthy. The average check ensures that a cluster full of moderately high read amplification stores is not unable to make progress, whilst still ensuring that positively skewed distributions exclude the positive tail. Simulation of the effect on candidate exclusion under different L0 sub-level distributions by using the mean as an additional check vs percentiles can be found here: https://gist.github.com/kvoli/be27efd4662e89e8918430a9c7117858 The threshold corresponds to the cluster setting `kv.allocator.L0_sublevels_threshold`, which is the number of L0 sub-levels, that when a candidate store exceeds it will be potentially excluded as a target for rebalancing, or both rebalancing and allocation of replicas. The enforcement of this threshold can be applied under 4 different levels of strictness. This is configured by the cluster setting: `kv.allocator.L0_sublevels_threshold_enforce`. The 4 levels are: `block_none`: L0 sub-levels is ignored entirely. `block_none_log`: L0 sub-levels are logged if threshold exceeded. Both states below log as above. `block_rebalance_to`: L0 sub-levels are considered when excluding stores for rebalance targets. `block_all`: L0 sub-levels are considered when excluding stores for rebalance targets and allocation targets. By default, `kv.allocator.L0_sublevels_threshold` is `20`. Which corresponds to admissions control's threshold, above which it begins limiting admission of work to a store based on store health. The default enforcement level of `kv.allocator.L0_sublevels_threshold_enforce` is `block_none_log`. resolves cockroachdb#73714 Release justification: low risk, high benefit during high read amplification scenarios where an operator may limit rebalancing to high read amplification stores, to stop fueling the flame. Release note (ops change): introduce cluster settings `kv.allocator.l0_sublevels_threshold` and `kv.allocator.L0_sublevels_threshold_enforce`, which enable excluding stores as targets for allocation and rebalancing of replicas when they have high read amplification, indicated by the number of L0 sub-levels in level 0 of the store's LSM. When both `kv.allocator.l0_sublevels_threshold` and the cluster average is exceeded, the action corresponding to `kv.allocator.l0_sublevels_threshold_enforce` is taken. `block_none` will exclude no candidate stores, `block_none_log` will exclude no candidates but log an event, `block_rebalance_to` will exclude candidates stores from being targets of rebalance actions, `block_all` will exclude candidate stores from being targets of both allocation and rebalancing. Default `kv.allocator.l0_sublevels_threshold` is set to `20` and `kv.allocator.l0_sublevels_threshold_enforce` is set to `block_none_log`.

kvoli · 2022-04-11T14:23:32Z

bors r+

craig · 2022-04-11T15:17:41Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-04-11T15:17:54Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 83deaaa to blathers/backport-release-22.1-78608: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

kvoli added the do-not-merge bors won't merge a PR with this label. label Mar 28, 2022

kvoli self-assigned this Mar 28, 2022

kvoli force-pushed the 220323.readamp-allocate branch 9 times, most recently from 027c043 to c9cd413 Compare March 28, 2022 21:44

kvoli requested a review from aayushshah15 March 28, 2022 21:45

kvoli force-pushed the 220323.readamp-allocate branch 12 times, most recently from 00c06d4 to ec1084d Compare March 29, 2022 21:28

kvoli marked this pull request as ready for review March 29, 2022 22:09

kvoli requested a review from a team as a code owner March 29, 2022 22:09

kvoli removed the do-not-merge bors won't merge a PR with this label. label Mar 29, 2022

tbg reviewed Mar 30, 2022

View reviewed changes

kvoli force-pushed the 220323.readamp-allocate branch from ec1084d to e83ad40 Compare March 30, 2022 14:33

aayushshah15 approved these changes Apr 8, 2022

View reviewed changes

kvoli added backport-21.1.x and removed backport-21.1.x labels Apr 8, 2022

kvoli force-pushed the 220323.readamp-allocate branch from 3acf9c9 to a1f4ca6 Compare April 8, 2022 18:46

kvoli commented Apr 8, 2022

View reviewed changes

kvoli removed the backport-21.1.x label Apr 8, 2022

kvoli removed the backport-22.1.x label Apr 8, 2022

cockroach-teamcity mentioned this pull request Apr 11, 2022

kvserver: check L0 sub-levels on allocation cockroachdb/docs#13569

Closed

kvoli mentioned this pull request Jun 8, 2022

kvserver: allocation algorithm should transfer replicas and leases away from an overloaded store #82611

Closed

This was referenced Feb 3, 2023

kvserver: allocator should use io threshold instead of l0 sublevels #85084

Closed

kvserver: consider io overload for lease transfers #96508

Closed

kvoli mentioned this pull request Feb 22, 2023

allocator: replace read amp with io thresh #97142

Merged

Conversation

kvoli commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 28, 2022

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

kvoli Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

tbg Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

kvoli Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

tbg Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

kvoli Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli commented Apr 8, 2022

Uh oh!

kvoli commented Apr 8, 2022

Uh oh!

craig bot commented Apr 8, 2022

Uh oh!

kvoli commented Apr 8, 2022

Uh oh!

aayushshah15 commented Apr 8, 2022

Uh oh!

shralex commented Apr 9, 2022

Uh oh!

aayushshah15 commented Apr 9, 2022

Uh oh!

craig bot commented Apr 9, 2022

Uh oh!

craig bot commented Apr 9, 2022

Uh oh!

craig bot commented Apr 9, 2022

Uh oh!

kvoli commented Apr 9, 2022

Uh oh!

kvoli commented Apr 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvoli commented Apr 11, 2022

Uh oh!

craig bot commented Apr 11, 2022

Uh oh!

blathers-crl bot commented Apr 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kvoli commented Mar 28, 2022 •

edited

Loading

kvoli commented Apr 9, 2022 •

edited

Loading