kvserver: fix bug obstructing range merges on multi-store clusters by aayushshah15 · Pull Request #65584 · cockroachdb/cockroach

aayushshah15 · 2021-05-21T23:14:38Z

Previously, we had a safeguard inside allocateTargetFromList() (allocator
method used for finding the best target to allocate a new replica) that ensured
that it would never return a target store on the same node as one of the
existing replicas. This check was mostly well-intentioned but it became
outdated after we started allowing rebalances between stores on the same node
in #51567.

The aforementioned check is no longer correct since:

Callers of AdminRelocateRange() (currently, the merge queue and the
StoreRebalancer) must have the ability to move replicas laterally.
We have a more precise check inside AdminChangeReplicas that guards
against adding 2 replicas of a range on the same node. This check precisely
allows the cases where we're rebalancing laterally.

As a result of this check, clusters that use multiple stores per node were
more likely to have their range merges fail because the merge queue would fail
in its attempt to collocate ranges that required lateral movement of replicas
across stores. @adityamaru noticed error messages of the following nature
flooding the logs on a cluster while debugging an incident:

none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3

This patch fixes this behavior by allowing AdminRelocateRange() to disable
this check, because it may often find itself trying to execute upon relocations
that require moving replicas laterally within a given node. Note that we do not
allow upreplication logic to disable this check since we do not want to
upreplicate across stores on the same node.

Release note (bug fix): A bug that made it less likely for range merges to
succeed on clusters using multiple stores per node is now fixed.

cockroach-teamcity · 2021-05-21T23:14:45Z

This change is

lunevalex

LGTM

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):

// AllocateVoter returns a suitable store for a new allocation of a voting
// replica with the required attributes. Nodes already accommodating existing

nit: update this comment

pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

	existingStoreLocalities map[roachpb.StoreID]roachpb.Locality,
	isNodeValidForRoutineReplicaTransfer func(context.Context, roachpb.NodeID) bool,
	allowMultipleReplsPerNode bool,

nit: add a comment here about why allowMultipleReplsPerNode is needed

nvb

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)

pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):

		existingNonVoters,
		a.scorerOptions(),
		false, /* allowMultipleReplsPerNode */

Could you add a comment explaining why we pass false here?

pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

Previously, lunevalex wrote…

nit: add a comment here about why allowMultipleReplsPerNode is needed

Agreed. We should also add comments to each of the conditions below.

pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):

	var candidates candidateList
	for _, s := range candidateStores.stores {
		if storeHasReplica(s.StoreID, roachpb.MakeReplicaSet(existingReplicas).ReplicationTargets()) {

nit: This looks strange. Do you happen to know why storeHasReplica takes a []roachpb.ReplicationTarget but nodeHasReplica takes a []roachpb.ReplicaDescriptor?

pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):

				a.storePool.getLocalitiesByStore(existingRepls),
				a.storePool.isNodeReadyForRoutineReplicaTransfer,
				false,

nit: false /* allowMultipleReplsPerNode */ here and below

pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):

		{NodeID: 3, StoreID: 5},
	}...)
	// Now, ask `AdminRelocateRange()` to move some of these replicas laterally.

s/some/all/

pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):

		},
	)
	// Ensure that, in case a caller of `AdminRelocateRange` tries to place 2

Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?

pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):

			partialNonVoterTargets,
			options,
			true, /* allowMultipleReplsPerNode */

Could you add a comment explaining why we pass true here?

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex and @nvanbenschoten)

pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment explaining why we pass false here?

Done.

pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):

Previously, lunevalex wrote…

nit: update this comment

In this case, we actually rule out existing nodes because Allocate{Non}Voter is only used for upreplication.

pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. We should also add comments to each of the conditions below.

Done, let me know if they seem lacking.

pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: This looks strange. Do you happen to know why storeHasReplica takes a []roachpb.ReplicationTarget but nodeHasReplica takes a []roachpb.ReplicaDescriptor?

We don't seem to have good consistency around which of these allocator utility methods take a ReplicaDescriptor and which ones take a ReplicationTarget. I think we should just embed a ReplicationTarget in a ReplicaDescriptor -- it's something I've meant to do but just never got around to doing.

I cleaned up storeHasReplica and nodeHasReplica to at least be mutually consistent.

pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: false /* allowMultipleReplsPerNode */ here and below

Done.

pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/some/all/

Done.

pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?

Done.

pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment explaining why we pass true here?

Actually, I think I had a bug here (though it would have been benign because of the safeguard inside AdminChangeReplicas()). We shouldn't pass true here since the store rebalancer calls allocateTargetFromList with a partial list of relocation targets that it just picked out. It should never be picking out multiple targets that are on the same node.

nvb

Reviewed 5 of 5 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)

pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):

only disregard only

Some redundancy here.

Also, want to bold nodes in the previous sentence like you do for stores?

@adityamaru

Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.

aayushshah15

Thanks for the reviews!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex and @nvanbenschoten)

pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

only disregard only
Some redundancy here.

Also, want to bold nodes in the previous sentence like you do for stores?

Fixed and done.

aayushshah15 · 2021-05-31T04:21:10Z

This is not a particularly clean backport to 20.2, but we know of customers running large multi-store deployments with 20.2. Does it seem worth the risk to backport this to 20.2?

craig · 2021-05-31T05:23:06Z

Build succeeded:

GitHub CI (Cockroach)

nvb · 2021-06-02T15:23:55Z

This is not a particularly clean backport to 20.2, but we know of customers running large multi-store deployments with 20.2. Does it seem worth the risk to backport this to 20.2?

How unclean would the backport be? Is the code around here so different that it's a completely new change? Do you have concerns that it could introduce 20.2-only bugs?

In the commit message, you say that "clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores". Is the log message the only negative outcome of this? Does this result in a range merge getting stuck indefinitely? Are there known mitigations if a user does end up in this state?

aayushshah15 · 2021-06-02T22:28:51Z

How unclean would the backport be?

The allocator code is different because of all the non-voting replicas work, which included a fair few refactors. This patch is small enough that its not too bad to do it, but I would probably end up doing it as a completely new change since otherwise its too easy to make assumptions that weren't true in 20.2.

Is the log message the only negative outcome of this? Does this result in a range merge getting stuck indefinitely? Are there known mitigations if a user does end up in this state?

Yes, the log message would be the only user observable symptom. The range merges won't get stuck but unless some other form of rebalance activity removes the need to perform a lateral relocation, the ranges will not get merged -- which will cause the LHS replica to be in the mergeQueue's purgatory for 10 mins before the merge is retried. I don't see how we (as operators) would be able to mitigate this. We could mess around with the mergeQueue's cluster settings but they only let us do so much.

As I type this out, I do feel like I should just fix it in 20.2 as well (if you're not opposed to it, given that the symptoms are basically ~"some percentage of merges will fail"). I don't immediately see the potential for 20.2-only bugs.

aayushshah15 requested review from lunevalex and nvb May 21, 2021 23:14

aayushshah15 added backport-20.2.x labels May 21, 2021

aayushshah15 force-pushed the 20210521_mergeFailures branch 6 times, most recently from 63b7f41 to fe8d968 Compare May 23, 2021 07:34

lunevalex approved these changes May 24, 2021

View reviewed changes

nvb reviewed May 27, 2021

View reviewed changes

aayushshah15 force-pushed the 20210521_mergeFailures branch 2 times, most recently from 9421b5c to 832c2fa Compare May 28, 2021 01:09

aayushshah15 commented May 28, 2021

View reviewed changes

nvb approved these changes May 28, 2021

View reviewed changes

aayushshah15 force-pushed the 20210521_mergeFailures branch 2 times, most recently from 57c7c50 to 6a620b8 Compare May 31, 2021 02:59

aayushshah15 force-pushed the 20210521_mergeFailures branch from 6a620b8 to 7ad39c7 Compare May 31, 2021 03:37

aayushshah15 commented May 31, 2021

View reviewed changes

aayushshah15 mentioned this pull request May 31, 2021

release-21.1: kvserver: fix bug obstructing range merges on multi-store clusters #65889

Merged

craig bot merged commit a64b26d into cockroachdb:master May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: fix bug obstructing range merges on multi-store clusters#65584

kvserver: fix bug obstructing range merges on multi-store clusters#65584
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20210521_mergeFailures

aayushshah15 commented May 21, 2021 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 21, 2021

Uh oh!

lunevalex left a comment

Uh oh!

nvb left a comment

Uh oh!

aayushshah15 left a comment

Uh oh!

nvb left a comment

Uh oh!

aayushshah15 left a comment

Uh oh!

aayushshah15 commented May 31, 2021

Uh oh!

craig bot commented May 31, 2021

Uh oh!

nvb commented Jun 2, 2021

Uh oh!

aayushshah15 commented Jun 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aayushshah15 commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 21, 2021

Uh oh!

lunevalex left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 commented May 31, 2021

Uh oh!

craig bot commented May 31, 2021

Uh oh!

nvb commented Jun 2, 2021

Uh oh!

aayushshah15 commented Jun 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aayushshah15 commented May 21, 2021 •

edited

Loading