Skip to content

kvserver: rebalance ranges to minimize QPS delta among stores #72296

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:preventRebalanceThrashing
Jan 29, 2022
Merged

kvserver: rebalance ranges to minimize QPS delta among stores #72296
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:preventRebalanceThrashing

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Nov 1, 2021

kvserver: rebalance ranges to minimize QPS delta among stores

This commit fixes the regression(s) introduced by
#65379 where we observed replica
thrashing in various workloads (#70396 and #71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before) and the "new" implementation (22.1 and beyond).

lease rebalancing
release 21.2 and before
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

master and this commit
In #65379, we moved this decision making into the Allocator by adding a new
mode in Allocator.TransferLeaseTarget that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

replica rebalancing
release 21.2 and before
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

master and this commit
The approach implemented by #65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
replicateQueue) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Manual testing
This patch has been stress tested with the follower reads roachtests (~250 iterations of
follower-reads/survival=region/locality=global/reads=strong and 100 iterations of
follower-reads/survival=zone/locality=regional/reads=exact-staleness). It has also been
stress tested with the rebalance/by-load roachtests (100 iterations for both ..leases and
..replicas tests). I also manually ran a TPCC 10K run with a small ramp (something we
know triggers #31135) a few times and
saw average QPS converge among stores fairly quickly.
tpcc-with-low-ramp

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevented CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the preventRebalanceThrashing branch 24 times, most recently from 7a15a96 to 1b21828 Compare December 6, 2021 20:16
@aayushshah15 aayushshah15 force-pushed the preventRebalanceThrashing branch 2 times, most recently from 560d92d to b2a89f9 Compare December 7, 2021 19:29
@aayushshah15 aayushshah15 force-pushed the preventRebalanceThrashing branch 2 times, most recently from 49754af to 1f38a11 Compare January 16, 2022 05:59
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1467 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we be using overfullQPSThreshold here to avoid duplicating logic?

Done.


pkg/kv/kvserver/allocator.go, line 1491 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider a small comment here mentioning why we defensively use math.Max.

Done.


pkg/kv/kvserver/allocator.go, line 1496 at r2 (raw file):
How do you feel about the organization in this revision?

Writing that out makes me wonder — should we be comparing minQPSDifferenceForLeaseTransfers against diff or diffIgnoringLease?

Doing it against diff doesn't guarantee that the difference between these two stores after the lease transfer will be lower, but doing it against diffIgnoringLease does guarantee that.

Consider

leaseReplQPS: 100 
leaseStoreQPS: 120
bestOptionQPS: 100 

pkg/kv/kvserver/allocator_scorer.go, line 119 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is sd always the existing member of eqClass?

It is, I should've simplified this in #73951. Done now.


pkg/kv/kvserver/allocator_scorer.go, line 125 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is sd always one of the members of eqClass.candidates?

yes, renamed.


pkg/kv/kvserver/allocator_scorer.go, line 131 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This doesn't give much of an indication about what the difference means or why it is used. Could you reframe some of this in terms of what this method is telling us? What are removalCandidates? What is sd? Is sd one of the removalCandidates?

Should I be thinking of this as removalConvergesScoreTheMost? As in, sd is the best candidate of all removalCandidates to remove?

Yes, I added that in the comment.


pkg/kv/kvserver/allocator_scorer.go, line 132 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should removalCandidates be a StoreList?

changed to a StoreList for consistency with the others.


pkg/kv/kvserver/allocator_scorer.go, line 216 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Not including the existing store, right?

Clarified.


pkg/kv/kvserver/allocator_scorer.go, line 226 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's try to avoid saying something is identical. If it's identical, then it's redundant and should be removed. So what's the difference? Why do both of these methods need to exist. Even if their implementation is the same (other than the input types) for rangeCountScorerOptions, what do they each mean semantically?

Done.


pkg/kv/kvserver/allocator_scorer.go, line 247 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Related to a question elsewhere, are we assuming that each replica is currently receiving the same level of traffic and making decisions based on this current state, or are we assuming that if we were to transfer the lease to any given replica, it would serve this level of traffic, and making decisions based on this predicted state? I think those two things may be identical in practice, but they are saying different things.

Laid out the two assumptions directly. LMK if you think we need to add something more.


pkg/kv/kvserver/allocator_scorer.go, line 338 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we care about ties? In the case where multiple stores have the same maximum qps, should they all return -1?

It doesn't really matter in this case, since we'd just randomly pick one of the stores with the maximum QPS rather than the first one (from the removalCandidateStores slice), but I changed it for the sake of clarity.


pkg/kv/kvserver/allocator_scorer.go, line 834 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you mention the relationship between these two lists? Are they always the same length? Ar the candidates always in the same order?

Done. LMK if you dislike the comment.


pkg/kv/kvserver/allocator_scorer.go, line 842 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give some explanation for why these two consts are exactly double the minQPSThresholdDifference?

I set it to 200 because in the previous implementation, the minimum difference between an underfull and an overfull store would have been a minimum of 2 * minQPSThresholdDifference. It also seemed like 100 was a little too low to avoid sensitivity towards normal jitters in the workload (though I just tested with it set to 100 just to ensure it didn't cause thrashing in the follower reads roachtest at least).

Since none of this is very principled, I've changed this to a Reserved cluster setting for now.


pkg/kv/kvserver/allocator_scorer.go, line 853 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could the fact that these are larger than minQPSDifferenceFor{Replica,Lease}Transfers cause divergence or thrashing? While reading the code above, I was assuming that they were smaller, so that we would perform certain changes, but wouldn't want to revert them.

minQPSDifferenceForTransfers compares the QPS of the sending and receiving stores ignoring the QPS of the replica and lease being transferred. The Overshoot constant applies to the QPS of the sending and receiving stores after the transfer (not ignoring the QPS of the replica or lease).

I might be missing your point, but the way I've been looking at these is that the maxQPSTransferOvershoot disallows transfers of extremely hot leases/ranges in favor of colder ranges/leases.


pkg/kv/kvserver/allocator_scorer.go, line 857 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this belong as a method on equivalenceClass or on qpsScorerOptions? Right now we're defining a rebalancing policy as a property of the data, instead of passing data to the rebalancing policy. That feels a little backwards. It also leaks this policy to all users of equivalenceClass, when it really only makes sense in the context of a qpsScorerOptions.

Done, let me know if you dislike the new structure.


pkg/kv/kvserver/allocator_scorer.go, line 864 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: give this a capacity hint.

done.


pkg/kv/kvserver/allocator_scorer.go, line 871 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Is this variable useful? Can it be inlined in the coldestCandidate computation?

done.


pkg/kv/kvserver/allocator_scorer.go, line 885 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This logic looks very similar to (but so much easier to read than) the logic you were touching in Allocator.TransferLeaseTarget. Are there meaningful differences between the two? Could then be unified (which might be easier if this was a method on qpsScorerOptions and not equivalenceClass.

Unified.


pkg/kv/kvserver/allocator_scorer.go, line 901 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Same question about using overfullQPSThreshold for this.

Unified the two implementations, so this is gone.


pkg/kv/kvserver/allocator_scorer.go, line 938 at r2 (raw file):

Is this when perReplicaQPS > 2*qpsDifference?

I was having trouble seeing that relationship. I think you're correct that this should never be the case (for the same reason as in TransferLeaseTarget).


pkg/kv/kvserver/allocator_scorer.go, line 947 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we keep this if !needRebalance condition?

Added it back.


pkg/kv/kvserver/store_rebalancer.go, line 575 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could we say more here? This seems like an important understanding that was hard-won, so we should make sure it's written down while it's still fresh in our minds. Why can this cause thrashing? Why is this the right thing to do? I think you've mentioned in the past that this is an important consideration if we're performing a rebalance in order to later transfer the lease.

It's probably also worth mentioning that this could actually be underestimating their load if the followers are serving more traffic than the leaseholder.

Added a better explanation.


pkg/kv/kvserver/store_rebalancer.go, line 596 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this an independent change? I remember discussing the logic below causing issues in some cases, but don't recall the details.

Separated out into its own patch.


pkg/kv/kvserver/store_rebalancer.go, line 636 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mention what the ok return variable indicates and when it might be false.

pulled this out into a separate PR.


pkg/kv/kvserver/store_rebalancer_test.go, line 439 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

According to the comment, shouldn't this be 5?

Oops, fixed now. Test wasn't failing because s5 was projected to become significantly hotter than s1 after the rebalance.

@aayushshah15 aayushshah15 force-pushed the preventRebalanceThrashing branch 4 times, most recently from 83af0c0 to e6f0051 Compare January 17, 2022 09:10
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nvanbenschoten, I have some concerns implementing the version gating strategy that we discussed offline. Reverting the commits from #65379 on master has become fairly unclean due to the work that has landed in the interim.

From our discussion, my understanding was that we wanted to revert #65379, introduce a version gate, and then unrevert #65379 on top of that (which then would compose cleanly with this PR). Both the revert and then the unrevert are going to be fairly unclean. It feels to me that we're better off separately re-introducing the old StoreRebalancer or building some confidence that we can run this new StoreRebalancer in mixed version settings. I'm wondering if you have any strong preferences here (either about doing the unclean revert, or about any of the other choices we have).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting close! Thanks for all the work here.

If the revert is now unclean then I don't think it's worth performing it. So I agree that we'd be better off separately re-introducing the old StoreRebalancer.

Reviewed 1 of 5 files at r4, 6 of 6 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 131 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Yes, I added that in the comment.

Thanks for the comment. How do you feel about also changing the name to make this more clear? Something like removalMaximallyConvergesScore?


pkg/kv/kvserver/allocator_scorer.go, line 247 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Laid out the two assumptions directly. LMK if you think we need to add something more.

The two assumptions seem to contradict each other. The first says that all replicas receive the same amount of traffic at all times while the second implies that the position of the lease impacts the replica load. Do we need to change the first to say that "Every replica of a range currently receives the same level of traffic"?


pkg/kv/kvserver/allocator_scorer.go, line 834 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Done. LMK if you dislike the comment.

Sounds good to me, though I'd get rid of "just".


pkg/kv/kvserver/allocator_scorer.go, line 853 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

minQPSDifferenceForTransfers compares the QPS of the sending and receiving stores ignoring the QPS of the replica and lease being transferred. The Overshoot constant applies to the QPS of the sending and receiving stores after the transfer (not ignoring the QPS of the replica or lease).

I might be missing your point, but the way I've been looking at these is that the maxQPSTransferOvershoot disallows transfers of extremely hot leases/ranges in favor of colder ranges/leases.

I see, thanks for the explanation.


pkg/kv/kvserver/allocator_scorer.go, line 857 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Done, let me know if you dislike the new structure.

This is much better. Thanks.


pkg/kv/kvserver/allocator_scorer.go, line 885 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Unified.

Very nice!


pkg/kv/kvserver/allocator_scorer.go, line 309 at r7 (raw file):

		return false
	case none:
	default:

Should we fatal on this branch? Also, should we remove all the early returns and instead return noRebalanceReason == none below? That limits the room for bugs in this logic and makes things a little more declarative.


pkg/kv/kvserver/allocator_scorer.go, line 347 at r7 (raw file):

	eqClass equivalenceClass, candidate roachpb.StoreDescriptor,
) int {
	bestTarget, noRebalanceReason := o.getRebalanceTargetToMinimizeDelta(eqClass)

nit: s/noRebalanceReason/declineReason/g unless there's a good reason to use a different name.


pkg/kv/kvserver/allocator_scorer.go, line 873 at r7 (raw file):

)

type declineReason int

Let's give this a comment, along with a short comment on each variant explaining why they prevent rebalancing.


pkg/kv/kvserver/allocator_scorer.go, line 876 at r7 (raw file):

const (
	none declineReason = iota

nit: none continued to surprise me as the "good" case. Consider a different name that doesn't require readers to know the enum type's name to contextualize. For instance, shouldRebalance.


pkg/kv/kvserver/allocator_scorer.go, line 889 at r7 (raw file):

// coldest store in the equivalence class.
func bestStoreToMinimizeQPSDelta(
	qps, rebalanceThreshold, minRequiredQPSDiff float64,

nit: qps is ambiguous. Is there a better name we could give this to make its role more clear?


pkg/kv/kvserver/allocator_scorer.go, line 1738 at r7 (raw file):

}

func overfullQPSThreshold(rebalanceThreshold float64, mean float64) float64 {

Why the asymmetry between this function and the others around here? If we need to construct a qpsScorerOptions at a caller, that seems preferable to this.

Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 247 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The two assumptions seem to contradict each other. The first says that all replicas receive the same amount of traffic at all times while the second implies that the position of the lease impacts the replica load. Do we need to change the first to say that "Every replica of a range currently receives the same level of traffic"?

How do you feel about the latest revision? I created #75630 to track the work to improve our simulation for TransferLeaseTarget.


pkg/kv/kvserver/allocator_scorer.go, line 834 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Sounds good to me, though I'd get rid of "just".

Done.


pkg/kv/kvserver/allocator_scorer.go, line 309 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we fatal on this branch? Also, should we remove all the early returns and instead return noRebalanceReason == none below? That limits the room for bugs in this logic and makes things a little more declarative.

Done.


pkg/kv/kvserver/allocator_scorer.go, line 347 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: s/noRebalanceReason/declineReason/g unless there's a good reason to use a different name.

Done.


pkg/kv/kvserver/allocator_scorer.go, line 873 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's give this a comment, along with a short comment on each variant explaining why they prevent rebalancing.

Done.


pkg/kv/kvserver/allocator_scorer.go, line 876 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: none continued to surprise me as the "good" case. Consider a different name that doesn't require readers to know the enum type's name to contextualize. For instance, shouldRebalance.

Done.


pkg/kv/kvserver/allocator_scorer.go, line 889 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: qps is ambiguous. Is there a better name we could give this to make its role more clear?

LMK if you're unhappy with what's in the latest revision.


pkg/kv/kvserver/allocator_scorer.go, line 1738 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why the asymmetry between this function and the others around here? If we need to construct a qpsScorerOptions at a caller, that seems preferable to this.

Done.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 876 at r7 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Done.

Thanks, these are very helpful.

This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
@aayushshah15
Copy link
Copy Markdown
Contributor Author

thanks for the reviews!!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 28, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed:

@aayushshah15
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed:

@aayushshah15
Copy link
Copy Markdown
Contributor Author

TestLogic seems flakey lately.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants