Skip to content

kvserver: fix bug obstructing range merges on multi-store clusters#65584

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20210521_mergeFailures
May 31, 2021
Merged

kvserver: fix bug obstructing range merges on multi-store clusters#65584
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20210521_mergeFailures

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented May 21, 2021

Previously, we had a safeguard inside allocateTargetFromList() (allocator
method used for finding the best target to allocate a new replica) that ensured
that it would never return a target store on the same node as one of the
existing replicas. This check was mostly well-intentioned but it became
outdated after we started allowing rebalances between stores on the same node
in #51567.

The aforementioned check is no longer correct since:

  1. Callers of AdminRelocateRange() (currently, the merge queue and the
    StoreRebalancer) must have the ability to move replicas laterally.
  2. We have a more precise check inside AdminChangeReplicas that guards
    against adding 2 replicas of a range on the same node. This check precisely
    allows the cases where we're rebalancing laterally.

As a result of this check, clusters that use multiple stores per node were
more likely to have their range merges fail because the merge queue would fail
in its attempt to collocate ranges that required lateral movement of replicas
across stores. @adityamaru noticed error messages of the following nature
flooding the logs on a cluster while debugging an incident:

none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3

This patch fixes this behavior by allowing AdminRelocateRange() to disable
this check, because it may often find itself trying to execute upon relocations
that require moving replicas laterally within a given node. Note that we do not
allow upreplication logic to disable this check since we do not want to
upreplicate across stores on the same node.

Release note (bug fix): A bug that made it less likely for range merges to
succeed on clusters using multiple stores per node is now fixed.

@aayushshah15 aayushshah15 requested review from lunevalex and nvb May 21, 2021 23:14
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20210521_mergeFailures branch 6 times, most recently from 63b7f41 to fe8d968 Compare May 23, 2021 07:34
Copy link
Copy Markdown

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):

// AllocateVoter returns a suitable store for a new allocation of a voting
// replica with the required attributes. Nodes already accommodating existing

nit: update this comment


pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

	existingStoreLocalities map[roachpb.StoreID]roachpb.Locality,
	isNodeValidForRoutineReplicaTransfer func(context.Context, roachpb.NodeID) bool,
	allowMultipleReplsPerNode bool,

nit: add a comment here about why allowMultipleReplsPerNode is needed

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)


pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):

		existingNonVoters,
		a.scorerOptions(),
		false, /* allowMultipleReplsPerNode */

Could you add a comment explaining why we pass false here?


pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

Previously, lunevalex wrote…

nit: add a comment here about why allowMultipleReplsPerNode is needed

Agreed. We should also add comments to each of the conditions below.


pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):

	var candidates candidateList
	for _, s := range candidateStores.stores {
		if storeHasReplica(s.StoreID, roachpb.MakeReplicaSet(existingReplicas).ReplicationTargets()) {

nit: This looks strange. Do you happen to know why storeHasReplica takes a []roachpb.ReplicationTarget but nodeHasReplica takes a []roachpb.ReplicaDescriptor?


pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):

				a.storePool.getLocalitiesByStore(existingRepls),
				a.storePool.isNodeReadyForRoutineReplicaTransfer,
				false,

nit: false /* allowMultipleReplsPerNode */ here and below


pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):

		{NodeID: 3, StoreID: 5},
	}...)
	// Now, ask `AdminRelocateRange()` to move some of these replicas laterally.

s/some/all/


pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):

		},
	)
	// Ensure that, in case a caller of `AdminRelocateRange` tries to place 2

Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?


pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):

			partialNonVoterTargets,
			options,
			true, /* allowMultipleReplsPerNode */

Could you add a comment explaining why we pass true here?

@aayushshah15 aayushshah15 force-pushed the 20210521_mergeFailures branch 2 times, most recently from 9421b5c to 832c2fa Compare May 28, 2021 01:09
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 733 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment explaining why we pass false here?

Done.


pkg/kv/kvserver/allocator.go, line 758 at r1 (raw file):

Previously, lunevalex wrote…

nit: update this comment

In this case, we actually rule out existing nodes because Allocate{Non}Voter is only used for upreplication.


pkg/kv/kvserver/allocator_scorer.go, line 417 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. We should also add comments to each of the conditions below.

Done, let me know if they seem lacking.


pkg/kv/kvserver/allocator_scorer.go, line 422 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: This looks strange. Do you happen to know why storeHasReplica takes a []roachpb.ReplicationTarget but nodeHasReplica takes a []roachpb.ReplicaDescriptor?

We don't seem to have good consistency around which of these allocator utility methods take a ReplicaDescriptor and which ones take a ReplicationTarget. I think we should just embed a ReplicationTarget in a ReplicaDescriptor -- it's something I've meant to do but just never got around to doing.

I cleaned up storeHasReplica and nodeHasReplica to at least be mutually consistent.


pkg/kv/kvserver/allocator_test.go, line 2692 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: false /* allowMultipleReplsPerNode */ here and below

Done.


pkg/kv/kvserver/client_relocate_range_test.go, line 559 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/some/all/

Done.


pkg/kv/kvserver/client_relocate_range_test.go, line 585 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we test what the behavior is when we ask AdminRelocateRange to place 2 non-voting replicas on the same node? And also 1 voting + 1 non-voting replica?

Done.


pkg/kv/kvserver/store_rebalancer.go, line 789 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment explaining why we pass true here?

Actually, I think I had a bug here (though it would have been benign because of the safeguard inside AdminChangeReplicas()). We shouldn't pass true here since the store rebalancer calls allocateTargetFromList with a partial list of relocation targets that it just picked out. It should never be picking out multiple targets that are on the same node.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @lunevalex)


pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):

only disregard only

Some redundancy here.

Also, want to bold nodes in the previous sentence like you do for stores?

@aayushshah15 aayushshah15 force-pushed the 20210521_mergeFailures branch 2 times, most recently from 57c7c50 to 6a620b8 Compare May 31, 2021 02:59
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator
method used for finding the best target to allocate a new replica) that ensured
that it would never return a target store on the same node as one of the
existing replicas. This check was mostly well-intentioned but it became
outdated after we started allowing rebalances between stores on the same node
in cockroachdb#51567.

The aforementioned check is no longer correct since:
1. Callers of `AdminRelocateRange()` (currently, the merge queue and the
`StoreRebalancer`) must have the ability to move replicas laterally.
2. We have a more precise check inside `AdminChangeReplicas` that guards
against adding 2 replicas of a range on the same node. This check precisely
allows the cases where we're rebalancing laterally.

As a result of this check, clusters that use multiple stores per node were
more likely to have their range merges fail because the merge queue would fail
in its attempt to collocate ranges that required lateral movement of replicas
across stores. @adityamaru noticed error messages of the following nature
flooding the logs on a cluster while debugging an incident:
```
none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3
```

This patch fixes this behavior by allowing `AdminRelocateRange()` to disable
this check, because it may often find itself trying to execute upon relocations
that require moving replicas laterally within a given node. Note that we do not
allow _upreplication_ logic to disable this check since we do not want to
upreplicate across stores on the same node.

Release note (bug fix): A bug that made it less likely for range merges to
succeed on clusters using multiple stores per node is now fixed.
@aayushshah15 aayushshah15 force-pushed the 20210521_mergeFailures branch from 6a620b8 to 7ad39c7 Compare May 31, 2021 03:37
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 412 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
only disregard only

Some redundancy here.

Also, want to bold nodes in the previous sentence like you do for stores?

Fixed and done.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

This is not a particularly clean backport to 20.2, but we know of customers running large multi-store deployments with 20.2. Does it seem worth the risk to backport this to 20.2?

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 31, 2021

Build succeeded:

@craig craig bot merged commit a64b26d into cockroachdb:master May 31, 2021
@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 2, 2021

This is not a particularly clean backport to 20.2, but we know of customers running large multi-store deployments with 20.2. Does it seem worth the risk to backport this to 20.2?

How unclean would the backport be? Is the code around here so different that it's a completely new change? Do you have concerns that it could introduce 20.2-only bugs?

In the commit message, you say that "clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores". Is the log message the only negative outcome of this? Does this result in a range merge getting stuck indefinitely? Are there known mitigations if a user does end up in this state?

@aayushshah15
Copy link
Copy Markdown
Contributor Author

How unclean would the backport be?

The allocator code is different because of all the non-voting replicas work, which included a fair few refactors. This patch is small enough that its not too bad to do it, but I would probably end up doing it as a completely new change since otherwise its too easy to make assumptions that weren't true in 20.2.

Is the log message the only negative outcome of this? Does this result in a range merge getting stuck indefinitely? Are there known mitigations if a user does end up in this state?

Yes, the log message would be the only user observable symptom. The range merges won't get stuck but unless some other form of rebalance activity removes the need to perform a lateral relocation, the ranges will not get merged -- which will cause the LHS replica to be in the mergeQueue's purgatory for 10 mins before the merge is retried. I don't see how we (as operators) would be able to mitigate this. We could mess around with the mergeQueue's cluster settings but they only let us do so much.

As I type this out, I do feel like I should just fix it in 20.2 as well (if you're not opposed to it, given that the symptoms are basically ~"some percentage of merges will fail"). I don't immediately see the potential for 20.2-only bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants