kvserver: actuate load-based replica rebalancing under heterogeneous localities#65379
Conversation
|
This needs a bit more love. Please hold off on reviewing. |
4358675 to
c0fd11a
Compare
b236586 to
3d8720d
Compare
|
Note: This patch does not deal with constraints aware lease rebalancing ( There's still some flakes I'm in the process of narrowing down, but this patch should be ready for feedback now. I'd suggest reviewing it commit by commit (if y'all weren't already going to do that). If it's still hard to review, please let me know if there's anything I can do to make it easier. |
andreimatei
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 846 at r2 (raw file):
// candidate for having a replica removed from it given the candidate store // list. This method returns true if there are any ranges that are on stores // that lie _outside of the [underfullThreshold, overfullThreshold] window_ for
I think the comment you've added is not quite right; it doesn't reference the one particular store that this method deals with.
pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):
} func overfullRangeThreshold(options scorerOptions, mean float64) float64 {
should all these functions error of fatal if the field they each want in options isn't set?
pkg/kv/kvserver/allocator_scorer.go, line 1391 at r2 (raw file):
} func overfullQPSThreshold(options scorerOptions, mean float64) float64 {
nit: I liked this function more with mean factored out of the math.Min.
Same for underfullQPSThreshold below
pkg/kv/kvserver/allocator_scorer.go, line 1399 at r2 (raw file):
} // rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from
nit: s/a 1 iff/1 if :P
pkg/kv/kvserver/allocator_scorer.go, line 1400 at r2 (raw file):
// rebalanceFromConvergesScore returns a 1 iff rebalancing a replica away from // `sc` will _not_ converge its range count towards the mean of stores in `sl`.
You've left one "range count" in there. You probably want it to adapt it to the new "options"
pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):
) int { if options.qpsRebalanceThreshold > 0 { // if `qpsRebalanceThreshold` is set, we disable the `convergesScore`.
Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below
pkg/kv/kvserver/allocator_scorer.go, line 461 at r3 (raw file):
} // rankedCandidateListForRemoval creates a candidate list of all existing
nit: I think it'd help if the comment on this method talked a bit about the context. So, someone has decided that one of the replicas of a range needs to go, and this function orders the replicas by load?
It'd also help to comment that sl contains the replicas. Perhaps rename to replicas.
pkg/kv/kvserver/allocator_scorer.go, line 491 at r3 (raw file):
}) } if options.deterministic {
if you can, please add a comment to options.deterministic and explain what that field is about
pkg/kv/kvserver/allocator_scorer.go, line 498 at r3 (raw file):
// We compute the converges and balance scores only in relation to stores that // are the top candidates for removal based on diversity (i.e. only among // candidates that are non-diverse relative to the rest of the replicas).
Can you expand on this comment, and spell out that, in the example below, we want 1 and 2 to be scored higher than 3 and 4? And then could you explain in prose how come the scoring will end up like that?
Another thing that would help is expanding the comments on the worst method. Right now it says something about "sharing the lowest constraint score", but I don't understand what that means.
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
// likely to be picked for removal, even if one of those is on a store that is // better or worse than the other. candidates = candidates.worst()
can this be a copy() ?
pkg/kv/kvserver/store_rebalancer.go, line 209 at r2 (raw file):
} func (sr *StoreRebalancer) rebalanceStore(
consider giving this method a comment
pkg/kv/kvserver/store_rebalancer.go, line 216 at r2 (raw file):
qpsRebalanceThreshold: qpsRebalanceThreshold.Get(&sr.st.SV), } qpsMinThreshold := underfullQPSThreshold(options, storeList.candidateQueriesPerSecond.mean)
consider putting a comment about these thresholds - how we only look at stores over the MaxThreshold and how we only transfer leases to streos below MinThreshold (right?)
3d8720d to
0ecaf1f
Compare
nvb
left a comment
There was a problem hiding this comment.
Exciting to see this!
I am curious to get your take on the backportability of the changes here.
Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3, 7 of 7 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)
pkg/kv/kvserver/allocator_scorer.go, line 1361 at r1 (raw file):
// the mean. func rebalanceFromConvergesScore(sl StoreList, sc roachpb.StoreCapacity) int { if rebalanceConvergesOnMean(sl, sc, sc.RangeCount-1) {
minor nit: I think this would be slightly easier to read and contrast with rebalanceToConvergesScore with if !rebalanceConvergesOnMean(...) { return 1 } return 0. Up to you.
pkg/kv/kvserver/allocator_scorer.go, line 184 at r2 (raw file):
return -(2 + float64(o.convergesScore-c.convergesScore)/10.0) } if !scoresAlmostEqual(float64(c.balanceScore), float64(o.balanceScore)) {
Any reason to cast to float64 in this line and the next? It seems like we should be using int arithmetic as long as possible, like we do with convergesScore.
pkg/kv/kvserver/allocator_scorer.go, line 851 at r2 (raw file):
// NB: If the given `options` have `qpsRebalanceThreshold` set, this method // makes its determination based on QPS. Otherwise, we fall back on using range // count as a signal.
So we're only ever considering either QPS or range count, but not both? If so, let's be more explicit about that.
pkg/kv/kvserver/allocator_scorer.go, line 888 at r2 (raw file):
// If we reached this point, we're happy with the range where it is.
Let's add this here as well.
pkg/kv/kvserver/allocator_scorer.go, line 1347 at r2 (raw file):
// // If the given options have qpsRebalanceThreshold set, we use that for // computing the balanceScore.
"Otherwise ..."
pkg/kv/kvserver/allocator_scorer.go, line 1349 at r2 (raw file):
// computing the balanceScore. // // TODO(aayush): It would be nice to be able to compose the two dimensions of
Have we thought about what affect this PR will have if we merge it without addressing this TODO? Will we no longer make any effort to balance based on range count? Are we relying on some callers setting qpsRebalanceThreshold and others not?
pkg/kv/kvserver/allocator_scorer.go, line 1351 at r2 (raw file):
// TODO(aayush): It would be nice to be able to compose the two dimensions of // balanceScore (QPS or RangeCount) and allow the `options` to simply specify an // order of precedence. Callers who care about the balancing out QPS across
s/out/of/
pkg/kv/kvserver/allocator_scorer.go, line 1383 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
should all these functions error of fatal if the field they each want in
optionsisn't set?
+1
pkg/kv/kvserver/allocator_scorer.go, line 1409 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Can you expand this comment? Why doesn't convergence apply to qps ?
Same in the function below
I'm also interested in this.
pkg/kv/kvserver/allocator_scorer.go, line 429 at r3 (raw file):
diversityScore := diversityAllocateScore(s, existingStoreLocalities) balanceScore := balanceScore(candidateStores, s.Capacity, options) var convergesScore int
Why did we remove this? I would have expected it to use rebalanceToConvergesScore.
pkg/kv/kvserver/allocator_scorer.go, line 514 at r3 (raw file):
// has more data constrained to it, both replicas 1 and 2 would be equally // likely to be picked for removal, even if one of those is on a store that is // better or worse than the other.
The implicit assumption here that tripped me up for a while is that, by comparing against region B and C, both replicas in region A will get a discrete balance score of moreThanMean (or overfull), and so their balance will look equivalent. If the balance score was some kind of higher-resolution proportion, then we wouldn't run into this problem, because regardless of the denominator, the replica with more QPS would get a higher score.
Do you mind weaving something about this into the comment?
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can this be a
copy()?
Or at least a pre-allocated list.
pkg/kv/kvserver/allocator_scorer.go, line 515 at r3 (raw file):
// likely to be picked for removal, even if one of those is on a store that is // better or worse than the other. candidates = candidates.worst()
So this means that only the worst candidates will be returned, right? So this function is no longer returning a ranked list of all stores in the input, it's returning a ranked list of the stores in the input with the worst diversity score. I think that's ok, given how the return value is used, but is that what you were expecting? Do we need to change the contract here?
pkg/kv/kvserver/allocator_scorer_test.go, line 1533 at r2 (raw file):
}{ {sEmpty, underfull}, {sMean, moreThanMean},
Do we want to add a test case that results in lessThanMean?
pkg/kv/kvserver/allocator_scorer_test.go, line 1556 at r2 (raw file):
expBalanceScore balanceStatus }{ {0, 1},
Want to use your new enums here?
pkg/kv/kvserver/allocator_test.go, line 1363 at r2 (raw file):
// We don't expect any QPS based rebalancing when all stores are serving // the same QPS. testStores: allStoresEqual,
nit: it's worth being explicit: expectRebalance: false,
pkg/kv/kvserver/allocator_test.go, line 1446 at r3 (raw file):
{ StoreID: 5, Node: roachpb.NodeDescriptor{NodeID: 5, Locality: region("a")},
nit: any reason this is store 5 but has the same region as store 1? This tripped me up.
pkg/kv/kvserver/store_pool.go, line 657 at r4 (raw file):
} // excludeInvalid takes a store list and removes stores that would be explicitly invalid
Want to pull this rename into a separate commit? That's probably better than having a complex change mixed in with a refactor.
pkg/kv/kvserver/store_rebalancer.go, line 209 at r4 (raw file):
// NB: The StoreRebalancer only cares about the convergence of QPS across // stores, not the convergence of range count. So, we don't use the allocator's // `scorerOptions` here, which set the range count rebalance threshold.
"Instead, we use our own scorerOptions, which sets the qps rebalance threshold."
pkg/kv/kvserver/store_rebalancer.go, line 400 at r4 (raw file):
} if shouldNotMoveAway(ctx, replWithStats, localDesc, now, minQPS) {
Should we inline this call, now that it's not used down below?
pkg/kv/kvserver/store_rebalancer.go, line 650 at r4 (raw file):
ctx context.Context, rbCtx rangeRebalanceContext, ) (finalVoterTargets, finalNonVoterTargets []roachpb.ReplicaDescriptor) { options := scorerOptions{
sr.scorerOptions()?
pkg/kv/kvserver/store_rebalancer_test.go, line 255 at r4 (raw file):
expectedRebalancedVoters, expectedRebalancedNonVoters []roachpb.StoreID }{ {
We seem to be deleting a lot of test cases. Is this because we're expecting the test coverage provided by TestChooseRangeToRebalance to be subsumed by TestChooseRangeToRebalanceRandom? I wonder if we should keep both. Randomized and table-driven tests that exercise similar logic can live side-by-side. In fact, they're a great combination. One finds new kinds of failures and one protects against regressions on known test cases.
pkg/kv/kvserver/store_rebalancer_test.go, line 391 at r4 (raw file):
ctx context.Context, allStores, deadStores []*roachpb.StoreDescriptor, meanQPS float64, ) { var summary bytes.Buffer
nit: strings.Builder
pkg/kv/kvserver/store_rebalancer_test.go, line 617 at r4 (raw file):
}, // Two replicas are in the hot region, both on relatively heavily loaded // nodes. We one of those replicas to be moved to a less busy store within
"We expect one"
pkg/kv/kvserver/store_rebalancer_test.go, line 652 at r4 (raw file):
expRebalancedVoters: []roachpb.StoreID{8, 6, 3}, qps: 50, },
Do you think it's worth adding some test cases that use constraints that look exactly like we'll generate in multi-region databases (for zone and region survival)?
aad90bf to
fe7cd1e
Compare
|
Pinging for next steps here. Get it baking on master and then look at a 21.1 backport? |
This needs a few more review cycles. I have not been able to spend any time on this in the last little bit, but hope to in the next couple weeks. |
fe7cd1e to
41be519
Compare
a62013f to
70184ba
Compare
This commit renames `StoreList`'s `filter()` method to `excludeInvalid()` as the existing name was ambiguous. Release justification: Fixes high priority bug Release note: None
This commit augments `TransferLeaseTarget()` by adding a mode that picks the best lease transfer target that would lead to QPS convergence across the stores that have a replica for a given range. This commit implements a strategy that predicates lease transfer decisions on whether they would serve to reduce the QPS delta between existing replicas' stores. Resolves cockroachdb#31135 Release justification: Fixes high priority bug Release note (bug fix): Previously, the store rebalancer was unable to rebalance leases for hot ranges that received a disproportionate amount of traffic relative to the rest of the cluster. This often led to prolonged single node hotspots in certain workloads that led to hot ranges. This bug is now fixed.
072fd29 to
d61f474
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Thanks a lot for working through this patch with me. I'll merge on green CI to get it in before the branch cut.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1432 at r24 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, I think I had this backwards in the pseudo-code I provided. I meant for it to look like:
minCandidate := ... find candidate with lowest QPS ... if leaseholderStoreQPS-minCandidate.qps > leaseholderReplQPS { return minCandidate } return roachpb.ReplicaDescriptor{}but maybe this is easier to reason about as:
minCandidate := ... find candidate with lowest QPS ... if leaseholderStoreQPS-leaseholderReplQPS > minCandidate.qps { return minCandidate } return roachpb.ReplicaDescriptor{}The reasoning behind this is that given a leaseholder's total qps and qps for a single range, it is always beneficial to transfer the range to the coldest candidate if the range's own qps is smaller than the difference between the leaseholder store and the candidate store. This will always drive down the difference between those two stores, which should always drive down the difference between the store serving the highest QPS and the store serving the lowest QPS. I think that works out to be equivalent to what you have here, but I may be missing a set of cases.
Makes sense. One difference between the two approaches is in the case where there are two candidates with the lowest QPS, but seems like your expression does the right thing in that scenario anyway. Done.
pkg/kv/kvserver/allocator_test.go, line 393 at r49 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should
createTestAllocatorcall this function?return createTestAllocatorWithKnobs(numNodes, deterministic, nil).
Done.
pkg/kv/kvserver/store_rebalancer.go, line 228 at r21 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it's worth filing an issue about and explaining the limitation. Beyond that, I agree that we don't need to solve it now.
Filed: #69901
|
bors r+ |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 7 files at r51, 4 of 6 files at r61, 8 of 8 files at r62.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @andreimatei)
pkg/kv/kvserver/allocator.go, line 1453 at r62 (raw file):
leaseholderReplQPS, _ := stats.avgQPS() currentDelta := getQPSDelta(storeQPSMap, existing) bestOption := getCandidateWithMinQPS(storeQPSMap, existing)
Is the leaseholder included in existing? If so, can it be returned from bestOption? Are we just relying on the (leaseholderStoreQPS-leaseholderReplQPS) > storeQPSMap[bestOption.StoreID] evaluating to false in that case? I think that seems fine as long as leaseholderReplQPS > 0, but consider being more explicit about that. (this can be in a separate PR if you decide to do something here).
|
Build succeeded: |
…geneous localities" This reverts cockroachdb#65379 Release justification: reverting a patch that introduced a regression Release note: None
In cockroachdb#65379 we changed `balanceScore()` to classify stores in a cluster into 4 buckets: underfull, less-than-mean, more-than-mean and overfull. This allowed for the possibility of thrashing around the mean (i.e. replicas could ping-pong between stores classified as less-than-mean and more-than-mean). This patch moves balanceScore back to its previous incarnation, which only divided resulting stores into 3 buckets. Release justification: Fixes regression introduced in a previous patch Release note: None
This commit teaches the
StoreRebalancerto make load-based rebalancingdecisions that are meaningful within the context of the replication constraints
placed on the ranges being relocated and the set of stores that can legally
receive replicas for such ranges.
Previously, the
StoreRebalancerwould compute the QPS underfull and overfullthresholds based on the overall average QPS being served by all stores in the
cluster. Notably, this included stores that were in replication zones that
would not satisfy required constraints for the range being considered for
rebalancing. This meant that the store rebalancer would effectively never be
able to rebalance ranges within the stores inside heavily loaded replication
zones (since all the valid stores would be above the overfull thresholds).
This patch is a move away from the bespoke relocation logic in the
StoreRebalancer. Instead, we have theStoreRebalancerrely on therebalancing logic used by the
replicateQueuethat already has the machineryto compute load based signals for candidates relative to other comparable
stores. The main difference here is that the
StoreRebalanceruses thismachinery to promote convergence of QPS across stores, whereas the
replicateQueueuses it to promote convergence of range counts. A series ofpreceeding commits in this patchset generalize the existing replica rebalancing
logic, and this commit teaches the
StoreRebalancerto use it.This generalization also addresses another key limitation (see #62992) of the
StoreRebalancerregarding its inability to make partial improvements to arange. Previously, if the
StoreRebalancercouldn't move a range entirelyoff of overfull stores, it would give up and not even move the subset of
replicas it could. This is no longer the case.
Resolves #61883
Resolves #62992
Resolves #31135
/cc @cockroachdb/kv
Release justification: fixes a set of major limitations behind numerous support escalations
Release note (performance improvement): QPS-based rebalancing is now
aware of different constraints placed on different replication zones. This
means that heterogeneously loaded replication zones (for instance, regions)
will achieve a more even distribution of QPS within the stores inside each
such zone.