kvserver: stop miscomputing equivalence classes during replica rebalancing#73597
Conversation
8826737 to
dcde60c
Compare
|
First commit from #73614 |
8db1993 to
1544b9a
Compare
|
@nvanbenschoten I've removed the commit that strips node ids out like we discussed. This PR should now be ready for a look. |
1544b9a to
8050265
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r1, 3 of 3 files at r4, 1 of 1 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator_scorer.go, line 915 at r6 (raw file):
diversityScore: diversityScore, } if !cand.less(existing) && cand.store.StoreID != existing.store.StoreID {
Should we wait until we get all the way here in this loop before continuing? Would it make more sense to move this up to the top of the loop?
pkg/kv/kvserver/allocator_scorer.go, line 967 at r6 (raw file):
// replicas in terms of diversity and disk fullness, check whether the // existing replicas' stats are divergent enough to justify a rebalance. if len(sl.stores) > 0 && options.shouldRebalanceBasedOnThresholds(ctx, existing.store, sl) {
When do we expect to hit the len(sl.stores) > 0 case? Is that when we don't have an equivalence class for an existing store?
8050265 to
5bccf47
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 915 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we wait until we get all the way here in this loop before continuing? Would it make more sense to move this up to the top of the loop?
moved up.
pkg/kv/kvserver/allocator_scorer.go, line 967 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
When do we expect to hit the
len(sl.stores) > 0case? Is that when we don't have an equivalence class for an existing store?
we'd expect len(candidateSL) > 0 if there are any replacement candidates for the existing store that are at least as good in terms of diversity. Otherwise, we expect that slice to be empty. I moved the length check inside the shouldRebalanceBasedOnThresholds methods.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
The existing nomenclature was confusing and was making it harder for cockroachdb#72296 to explain its approach. Release note: None
Previously, while computing the stores that belong inside the same equivalence class, we were incorrectly including an existing store in the list of candidate stores. For instance, an equivalence class that should look like the following: ``` eqClass: existing: [n1, n2] candidates: [n3, n4] ``` was instead being computed as the following: ``` eqClass: existing: [n1, n2] candidates: [n1, n3, n4] ``` This was throwing things off in the logic implemented by cockroachdb#72296 Release note: None
5bccf47 to
c5f374a
Compare
|
TFTR bors r+ |
|
Build succeeded: |
| // Only process replacement candidates, not existing stores. | ||
| if store.StoreID == existing.store.StoreID { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Apologies for the after-merge comments, why not filter full existingStores in here~? (I'm just learning cockroachdb code, maybe not right - - ~ thanks
For example: existing is n1 ,existingStores is [n1, n2, n3], equivalenceClasses.candidates still have chance to be [n2, n3] if we only filter existing here?
although line 1021 https://github.com/cockroachdb/cockroach/pull/73597/files#diff-272ccd4799645a1875c052c6cdafe1192be7d7be30035475226f679d79465662R1021 can filter those equivalenceClass
But may this cause the maybe valuable "n1 -> [n4]" rebalanceOption doesn't be considered?(because n1->[n2,n3] is more "better" in this loop but be filter in later loop) 🤔
kvserver: fix bug with computing equivalence class
Previously, while computing the stores that belong inside the same equivalence
class, we were incorrectly including an existing store in the list of candidate
stores.
For instance, an equivalence class that should look like the following:
was instead being computed as the following:
This was throwing things off in the logic implemented by #72296
Release note: None