Skip to content

kvserver: actuate load-based replica rebalancing under heterogeneous localities#65379

Merged
craig[bot] merged 7 commits intocockroachdb:masterfrom
aayushshah15:20210502_multiTierLoadBasedRebalancing
Sep 8, 2021
Merged

kvserver: actuate load-based replica rebalancing under heterogeneous localities#65379
craig[bot] merged 7 commits intocockroachdb:masterfrom
aayushshah15:20210502_multiTierLoadBasedRebalancing

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented May 18, 2021

This commit teaches the StoreRebalancer to make load-based rebalancing
decisions that are meaningful within the context of the replication constraints
placed on the ranges being relocated and the set of stores that can legally
receive replicas for such ranges.

Previously, the StoreRebalancer would compute the QPS underfull and overfull
thresholds based on the overall average QPS being served by all stores in the
cluster. Notably, this included stores that were in replication zones that
would not satisfy required constraints for the range being considered for
rebalancing. This meant that the store rebalancer would effectively never be
able to rebalance ranges within the stores inside heavily loaded replication
zones (since all the valid stores would be above the overfull thresholds).

This patch is a move away from the bespoke relocation logic in the
StoreRebalancer. Instead, we have the StoreRebalancer rely on the
rebalancing logic used by the replicateQueue that already has the machinery
to compute load based signals for candidates relative to other comparable
stores
. The main difference here is that the StoreRebalancer uses this
machinery to promote convergence of QPS across stores, whereas the
replicateQueue uses it to promote convergence of range counts. A series of
preceeding commits in this patchset generalize the existing replica rebalancing
logic, and this commit teaches the StoreRebalancer to use it.

This generalization also addresses another key limitation (see #62992) of the
StoreRebalancer regarding its inability to make partial improvements to a
range. Previously, if the StoreRebalancer couldn't move a range entirely
off of overfull stores, it would give up and not even move the subset of
replicas it could. This is no longer the case.

Resolves #61883
Resolves #62992
Resolves #31135

/cc @cockroachdb/kv

Release justification: fixes a set of major limitations behind numerous support escalations

Release note (performance improvement): QPS-based rebalancing is now
aware of different constraints placed on different replication zones. This
means that heterogeneously loaded replication zones (for instance, regions)
will achieve a more even distribution of QPS within the stores inside each
such zone.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15
Copy link
Copy Markdown
Contributor Author

This needs a bit more love. Please hold off on reviewing.

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 2 times, most recently from 4358675 to c0fd11a Compare May 25, 2021 04:06
@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 4 times, most recently from b236586 to 3d8720d Compare May 25, 2021 09:26
@aayushshah15 aayushshah15 requested review from andreimatei and nvb May 25, 2021 09:26
@aayushshah15
Copy link
Copy Markdown
Contributor Author

Note: This patch does not deal with constraints aware lease rebalancing (StoreRebalancer.chooseLeaseToTransfer()). That is a simpler change, but it should probably go in its own PR.

There's still some flakes I'm in the process of narrowing down, but this patch should be ready for feedback now. I'd suggest reviewing it commit by commit (if y'all weren't already going to do that). If it's still hard to review, please let me know if there's anything I can do to make it easier.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 846 at r2 (raw file):

// candidate for having a replica removed from it given the candidate store
// list. This method returns true if there are any ranges that are on stores
// that lie _outside of the [underfullThreshold, overfullThreshold] window_ for

I think the comment you've added is not quite right; it doesn't reference the one particular store that this method deals with.


pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):

}

func overfullRangeThreshold(options scorerOptions, mean float64) float64 {

should all these functions error of fatal if the field they each want in options isn't set?


pkg/kv/kvserver/allocator_scorer.go, line 1391 at r2 (raw file):

}

func overfullQPSThreshold(options scorerOptions, mean float64) float64 {

nit: I liked this function more with mean factored out of the math.Min.
Same for underfullQPSThreshold below


pkg/kv/kvserver/allocator_scorer.go, line 1399 at r2 (raw file):

}

// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from

nit: s/a 1 iff/1 if :P


pkg/kv/kvserver/allocator_scorer.go, line 1400 at r2 (raw file):

// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from
// `sc` will _not_ converge its range count towards the mean of stores in `sl`.

You've left one "range count" in there. You probably want it to adapt it to the new "options"


pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):

) int {
	if options.qpsRebalanceThreshold > 0 {
		// if `qpsRebalanceThreshold` is set, we disable the `convergesScore`.

Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below


pkg/kv/kvserver/allocator_scorer.go, line 461 at r3 (raw file):

}

// rankedCandidateListForRemoval creates a candidate list of all existing

nit: I think it'd help if the comment on this method talked a bit about the context. So, someone has decided that one of the replicas of a range needs to go, and this function orders the replicas by load?

It'd also help to comment that sl contains the replicas. Perhaps rename to replicas.


pkg/kv/kvserver/allocator_scorer.go, line 491 at r3 (raw file):

		})
	}
	if options.deterministic {

if you can, please add a comment to options.deterministic and explain what that field is about


pkg/kv/kvserver/allocator_scorer.go, line 498 at r3 (raw file):

	// We compute the converges and balance scores only in relation to stores that
	// are the top candidates for removal based on diversity (i.e. only among
	// candidates that are non-diverse relative to the rest of the replicas).

Can you expand on this comment, and spell out that, in the example below, we want 1 and 2 to be scored higher than 3 and 4? And then could you explain in prose how come the scoring will end up like that?

Another thing that would help is expanding the comments on the worst method. Right now it says something about "sharing the lowest constraint score", but I don't understand what that means.


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.
	candidates = candidates.worst()

can this be a copy() ?


pkg/kv/kvserver/store_rebalancer.go, line 209 at r2 (raw file):

}

func (sr *StoreRebalancer) rebalanceStore(

consider giving this method a comment


pkg/kv/kvserver/store_rebalancer.go, line 216 at r2 (raw file):

		qpsRebalanceThreshold: qpsRebalanceThreshold.Get(&sr.st.SV),
	}
	qpsMinThreshold := underfullQPSThreshold(options, storeList.candidateQueriesPerSecond.mean)

consider putting a comment about these thresholds - how we only look at stores over the MaxThreshold and how we only transfer leases to streos below MinThreshold (right?)

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch from 3d8720d to 0ecaf1f Compare May 25, 2021 21:09
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this!

I am curious to get your take on the backportability of the changes here.

Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3, 7 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/allocator_scorer.go, line 1361 at r1 (raw file):

// the mean.
func rebalanceFromConvergesScore(sl StoreList, sc roachpb.StoreCapacity) int {
	if rebalanceConvergesOnMean(sl, sc, sc.RangeCount-1) {

minor nit: I think this would be slightly easier to read and contrast with rebalanceToConvergesScore with if !rebalanceConvergesOnMean(...) { return 1 } return 0. Up to you.


pkg/kv/kvserver/allocator_scorer.go, line 184 at r2 (raw file):

		return -(2 + float64(o.convergesScore-c.convergesScore)/10.0)
	}
	if !scoresAlmostEqual(float64(c.balanceScore), float64(o.balanceScore)) {

Any reason to cast to float64 in this line and the next? It seems like we should be using int arithmetic as long as possible, like we do with convergesScore.


pkg/kv/kvserver/allocator_scorer.go, line 851 at r2 (raw file):

// NB: If the given `options` have `qpsRebalanceThreshold` set, this method
// makes its determination based on QPS. Otherwise, we fall back on using range
// count as a signal.

So we're only ever considering either QPS or range count, but not both? If so, let's be more explicit about that.


pkg/kv/kvserver/allocator_scorer.go, line 888 at r2 (raw file):

// If we reached this point, we're happy with the range where it is.

Let's add this here as well.


pkg/kv/kvserver/allocator_scorer.go, line 1347 at r2 (raw file):

//
// If the given options have qpsRebalanceThreshold set, we use that for
// computing the balanceScore.

"Otherwise ..."


pkg/kv/kvserver/allocator_scorer.go, line 1349 at r2 (raw file):

// computing the balanceScore.
//
// TODO(aayush): It would be nice to be able to compose the two dimensions of

Have we thought about what affect this PR will have if we merge it without addressing this TODO? Will we no longer make any effort to balance based on range count? Are we relying on some callers setting qpsRebalanceThreshold and others not?


pkg/kv/kvserver/allocator_scorer.go, line 1351 at r2 (raw file):

// TODO(aayush): It would be nice to be able to compose the two dimensions of
// balanceScore (QPS or RangeCount) and allow the `options` to simply specify an
// order of precedence. Callers who care about the balancing out QPS across

s/out/of/


pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

should all these functions error of fatal if the field they each want in options isn't set?

+1


pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below

I'm also interested in this.


pkg/kv/kvserver/allocator_scorer.go, line 429 at r3 (raw file):

		diversityScore := diversityAllocateScore(s, existingStoreLocalities)
		balanceScore := balanceScore(candidateStores, s.Capacity, options)
		var convergesScore int

Why did we remove this? I would have expected it to use rebalanceToConvergesScore.


pkg/kv/kvserver/allocator_scorer.go, line 514 at r3 (raw file):

	// has more data constrained to it, both replicas 1 and 2 would be equally
	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.

The implicit assumption here that tripped me up for a while is that, by comparing against region B and C, both replicas in region A will get a discrete balance score of moreThanMean (or overfull), and so their balance will look equivalent. If the balance score was some kind of higher-resolution proportion, then we wouldn't run into this problem, because regardless of the denominator, the replica with more QPS would get a higher score.

Do you mind weaving something about this into the comment?


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can this be a copy() ?

Or at least a pre-allocated list.


pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):

	// likely to be picked for removal, even if one of those is on a store that is
	// better or worse than the other.
	candidates = candidates.worst()

So this means that only the worst candidates will be returned, right? So this function is no longer returning a ranked list of all stores in the input, it's returning a ranked list of the stores in the input with the worst diversity score. I think that's ok, given how the return value is used, but is that what you were expecting? Do we need to change the contract here?


pkg/kv/kvserver/allocator_scorer_test.go, line 1533 at r2 (raw file):

	}{
		{sEmpty, underfull},
		{sMean, moreThanMean},

Do we want to add a test case that results in lessThanMean?


pkg/kv/kvserver/allocator_scorer_test.go, line 1556 at r2 (raw file):

		expBalanceScore balanceStatus
	}{
		{0, 1},

Want to use your new enums here?


pkg/kv/kvserver/allocator_test.go, line 1363 at r2 (raw file):

			// We don't expect any QPS based rebalancing when all stores are serving
			// the same QPS.
			testStores: allStoresEqual,

nit: it's worth being explicit: expectRebalance: false,


pkg/kv/kvserver/allocator_test.go, line 1446 at r3 (raw file):

		{
			StoreID:  5,
			Node:     roachpb.NodeDescriptor{NodeID: 5, Locality: region("a")},

nit: any reason this is store 5 but has the same region as store 1? This tripped me up.


pkg/kv/kvserver/store_pool.go, line 657 at r4 (raw file):

}

// excludeInvalid takes a store list and removes stores that would be explicitly invalid

Want to pull this rename into a separate commit? That's probably better than having a complex change mixed in with a refactor.


pkg/kv/kvserver/store_rebalancer.go, line 209 at r4 (raw file):

// NB: The StoreRebalancer only cares about the convergence of QPS across
// stores, not the convergence of range count. So, we don't use the allocator's
// `scorerOptions` here, which set the range count rebalance threshold.

"Instead, we use our own scorerOptions, which sets the qps rebalance threshold."


pkg/kv/kvserver/store_rebalancer.go, line 400 at r4 (raw file):

		}

		if shouldNotMoveAway(ctx, replWithStats, localDesc, now, minQPS) {

Should we inline this call, now that it's not used down below?


pkg/kv/kvserver/store_rebalancer.go, line 650 at r4 (raw file):

	ctx context.Context, rbCtx rangeRebalanceContext,
) (finalVoterTargets, finalNonVoterTargets []roachpb.ReplicaDescriptor) {
	options := scorerOptions{

sr.scorerOptions()?


pkg/kv/kvserver/store_rebalancer_test.go, line 255 at r4 (raw file):

		expectedRebalancedVoters, expectedRebalancedNonVoters []roachpb.StoreID
	}{
		{

We seem to be deleting a lot of test cases. Is this because we're expecting the test coverage provided by TestChooseRangeToRebalance to be subsumed by TestChooseRangeToRebalanceRandom? I wonder if we should keep both. Randomized and table-driven tests that exercise similar logic can live side-by-side. In fact, they're a great combination. One finds new kinds of failures and one protects against regressions on known test cases.


pkg/kv/kvserver/store_rebalancer_test.go, line 391 at r4 (raw file):

	ctx context.Context, allStores, deadStores []*roachpb.StoreDescriptor, meanQPS float64,
) {
	var summary bytes.Buffer

nit: strings.Builder


pkg/kv/kvserver/store_rebalancer_test.go, line 617 at r4 (raw file):

		},
		// Two replicas are in the hot region, both on relatively heavily loaded
		// nodes. We one of those replicas to be moved to a less busy store within

"We expect one"


pkg/kv/kvserver/store_rebalancer_test.go, line 652 at r4 (raw file):

			expRebalancedVoters: []roachpb.StoreID{8, 6, 3},
			qps:                 50,
		},

Do you think it's worth adding some test cases that use constraints that look exactly like we'll generate in multi-region databases (for zone and region survival)?

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 8 times, most recently from aad90bf to fe7cd1e Compare June 7, 2021 07:44
@shermanCRL
Copy link
Copy Markdown
Contributor

Pinging for next steps here. Get it baking on master and then look at a 21.1 backport?

@aayushshah15
Copy link
Copy Markdown
Contributor Author

Pinging for next steps here.

This needs a few more review cycles. I have not been able to spend any time on this in the last little bit, but hope to in the next couple weeks.

@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch from fe7cd1e to 41be519 Compare August 22, 2021 20:24
@aayushshah15 aayushshah15 requested a review from a team as a code owner August 22, 2021 20:24
@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch 3 times, most recently from a62013f to 70184ba Compare August 26, 2021 04:58
This commit renames `StoreList`'s `filter()` method to `excludeInvalid()` as
the existing name was ambiguous.

Release justification: Fixes high priority bug

Release note: None
This commit augments `TransferLeaseTarget()` by adding a mode that picks
the best lease transfer target that would lead to QPS convergence across
the stores that have a replica for a given range.

This commit implements a strategy that predicates lease transfer decisions on
whether they would serve to reduce the QPS delta between existing replicas'
stores.

Resolves cockroachdb#31135

Release justification: Fixes high priority bug

Release note (bug fix): Previously, the store rebalancer was unable to
rebalance leases for hot ranges that received a disproportionate amount
of traffic relative to the rest of the cluster. This often led to
prolonged single node hotspots in certain workloads that led to hot
ranges. This bug is now fixed.
@aayushshah15 aayushshah15 force-pushed the 20210502_multiTierLoadBasedRebalancing branch from 072fd29 to d61f474 Compare September 8, 2021 05:48
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working through this patch with me. I'll merge on green CI to get it in before the branch cut.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1432 at r24 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, I think I had this backwards in the pseudo-code I provided. I meant for it to look like:

minCandidate := ... find candidate with lowest QPS ...
if leaseholderStoreQPS-minCandidate.qps > leaseholderReplQPS {
    return minCandidate
}
return roachpb.ReplicaDescriptor{}

but maybe this is easier to reason about as:

minCandidate := ... find candidate with lowest QPS ...
if leaseholderStoreQPS-leaseholderReplQPS > minCandidate.qps {
    return minCandidate
}
return roachpb.ReplicaDescriptor{}

The reasoning behind this is that given a leaseholder's total qps and qps for a single range, it is always beneficial to transfer the range to the coldest candidate if the range's own qps is smaller than the difference between the leaseholder store and the candidate store. This will always drive down the difference between those two stores, which should always drive down the difference between the store serving the highest QPS and the store serving the lowest QPS. I think that works out to be equivalent to what you have here, but I may be missing a set of cases.

Makes sense. One difference between the two approaches is in the case where there are two candidates with the lowest QPS, but seems like your expression does the right thing in that scenario anyway. Done.


pkg/kv/kvserver/allocator_test.go, line 393 at r49 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should createTestAllocator call this function? return createTestAllocatorWithKnobs(numNodes, deterministic, nil).

Done.


pkg/kv/kvserver/store_rebalancer.go, line 228 at r21 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's worth filing an issue about and explaining the limitation. Beyond that, I agree that we don't need to solve it now.

Filed: #69901

@aayushshah15
Copy link
Copy Markdown
Contributor Author

bors r+

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 7 files at r51, 4 of 6 files at r61, 8 of 8 files at r62.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/allocator.go, line 1453 at r62 (raw file):

		leaseholderReplQPS, _ := stats.avgQPS()
		currentDelta := getQPSDelta(storeQPSMap, existing)
		bestOption := getCandidateWithMinQPS(storeQPSMap, existing)

Is the leaseholder included in existing? If so, can it be returned from bestOption? Are we just relying on the (leaseholderStoreQPS-leaseholderReplQPS) > storeQPSMap[bestOption.StoreID] evaluating to false in that case? I think that seems fine as long as leaseholderReplQPS > 0, but consider being more explicit about that. (this can be in a separate PR if you decide to do something here).

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 8, 2021

Build succeeded:

@craig craig bot merged commit fa9acb4 into cockroachdb:master Sep 8, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 20, 2021
…geneous localities"

This reverts cockroachdb#65379

Release justification: reverting a patch that introduced a regression

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 28, 2021
In cockroachdb#65379 we changed
`balanceScore()` to classify stores in a cluster into 4 buckets:
underfull, less-than-mean, more-than-mean and overfull. This
allowed for the possibility of thrashing around the mean (i.e. replicas
could ping-pong between stores classified as less-than-mean and
more-than-mean).

This patch moves balanceScore back to its previous incarnation, which
only divided resulting stores into 3 buckets.

Release justification: Fixes regression introduced in a previous patch

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants