kvserver: fix bug obstructing range merges on multi-store clusters#65584
Conversation
63b7f41 to
fe8d968
Compare
lunevalex
left a comment
There was a problem hiding this comment.
LGTM
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):
// AllocateVoter returns a suitable store for a new allocation of a voting // replica with the required attributes. Nodes already accommodating existing
nit: update this comment
pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):
existingStoreLocalities map[roachpb.StoreID]roachpb.Locality, isNodeValidForRoutineReplicaTransfer func(context.Context, roachpb.NodeID) bool, allowMultipleReplsPerNode bool,
nit: add a comment here about why allowMultipleReplsPerNode is needed
nvb
left a comment
There was a problem hiding this comment.
Reviewed 7 of 7 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)
pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):
existingNonVoters, a.scorerOptions(), false, /* allowMultipleReplsPerNode */
Could you add a comment explaining why we pass false here?
pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):
Previously, lunevalex wrote…
nit: add a comment here about why
allowMultipleReplsPerNodeis needed
Agreed. We should also add comments to each of the conditions below.
pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):
var candidates candidateList for _, s := range candidateStores.stores { if storeHasReplica(s.StoreID, roachpb.MakeReplicaSet(existingReplicas).ReplicationTargets()) {
nit: This looks strange. Do you happen to know why storeHasReplica takes a []roachpb.ReplicationTarget but nodeHasReplica takes a []roachpb.ReplicaDescriptor?
pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):
a.storePool.getLocalitiesByStore(existingRepls), a.storePool.isNodeReadyForRoutineReplicaTransfer, false,
nit: false /* allowMultipleReplsPerNode */ here and below
pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):
{NodeID: 3, StoreID: 5}, }...) // Now, ask `AdminRelocateRange()` to move some of these replicas laterally.
s/some/all/
pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):
}, ) // Ensure that, in case a caller of `AdminRelocateRange` tries to place 2
Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?
pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):
partialNonVoterTargets, options, true, /* allowMultipleReplsPerNode */
Could you add a comment explaining why we pass true here?
9421b5c to
832c2fa
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lunevalex and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a comment explaining why we pass
falsehere?
Done.
pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):
Previously, lunevalex wrote…
nit: update this comment
In this case, we actually rule out existing nodes because Allocate{Non}Voter is only used for upreplication.
pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Agreed. We should also add comments to each of the conditions below.
Done, let me know if they seem lacking.
pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: This looks strange. Do you happen to know why
storeHasReplicatakes a[]roachpb.ReplicationTargetbutnodeHasReplicatakes a[]roachpb.ReplicaDescriptor?
We don't seem to have good consistency around which of these allocator utility methods take a ReplicaDescriptor and which ones take a ReplicationTarget. I think we should just embed a ReplicationTarget in a ReplicaDescriptor -- it's something I've meant to do but just never got around to doing.
I cleaned up storeHasReplica and nodeHasReplica to at least be mutually consistent.
pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
false /* allowMultipleReplsPerNode */here and below
Done.
pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/some/all/
Done.
pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?
Done.
pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a comment explaining why we pass
truehere?
Actually, I think I had a bug here (though it would have been benign because of the safeguard inside AdminChangeReplicas()). We shouldn't pass true here since the store rebalancer calls allocateTargetFromList with a partial list of relocation targets that it just picked out. It should never be picking out multiple targets that are on the same node.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)
pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):
only disregard only
Some redundancy here.
Also, want to bold nodes in the previous sentence like you do for stores?
57c7c50 to
6a620b8
Compare
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.
6a620b8 to
7ad39c7
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Thanks for the reviews!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex and @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
only disregard onlySome redundancy here.
Also, want to bold nodes in the previous sentence like you do for stores?
Fixed and done.
|
This is not a particularly clean backport to 20.2, but we know of customers running large multi-store deployments with 20.2. Does it seem worth the risk to backport this to 20.2? |
|
Build succeeded: |
How unclean would the backport be? Is the code around here so different that it's a completely new change? Do you have concerns that it could introduce 20.2-only bugs? In the commit message, you say that "clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores". Is the log message the only negative outcome of this? Does this result in a range merge getting stuck indefinitely? Are there known mitigations if a user does end up in this state? |
The allocator code is different because of all the non-voting replicas work, which included a fair few refactors. This patch is small enough that its not too bad to do it, but I would probably end up doing it as a completely new change since otherwise its too easy to make assumptions that weren't true in 20.2.
Yes, the log message would be the only user observable symptom. The range merges won't get stuck but unless some other form of rebalance activity removes the need to perform a lateral relocation, the ranges will not get merged -- which will cause the LHS replica to be in the As I type this out, I do feel like I should just fix it in 20.2 as well (if you're not opposed to it, given that the symptoms are basically ~"some percentage of merges will fail"). I don't immediately see the potential for 20.2-only bugs. |
Previously, we had a safeguard inside
allocateTargetFromList()(allocatormethod used for finding the best target to allocate a new replica) that ensured
that it would never return a target store on the same node as one of the
existing replicas. This check was mostly well-intentioned but it became
outdated after we started allowing rebalances between stores on the same node
in #51567.
The aforementioned check is no longer correct since:
AdminRelocateRange()(currently, the merge queue and theStoreRebalancer) must have the ability to move replicas laterally.AdminChangeReplicasthat guardsagainst adding 2 replicas of a range on the same node. This check precisely
allows the cases where we're rebalancing laterally.
As a result of this check, clusters that use multiple stores per node were
more likely to have their range merges fail because the merge queue would fail
in its attempt to collocate ranges that required lateral movement of replicas
across stores. @adityamaru noticed error messages of the following nature
flooding the logs on a cluster while debugging an incident:
This patch fixes this behavior by allowing
AdminRelocateRange()to disablethis check, because it may often find itself trying to execute upon relocations
that require moving replicas laterally within a given node. Note that we do not
allow upreplication logic to disable this check since we do not want to
upreplicate across stores on the same node.
Release note (bug fix): A bug that made it less likely for range merges to
succeed on clusters using multiple stores per node is now fixed.