Skip to content

kvserver: teach replicateQueue to replace dead/decommisioning non-voters#61682

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
aayushshah15:replace-decommissioning-dead-non-voters
Mar 22, 2021
Merged

kvserver: teach replicateQueue to replace dead/decommisioning non-voters#61682
craig[bot] merged 2 commits intocockroachdb:masterfrom
aayushshah15:replace-decommissioning-dead-non-voters

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Mar 9, 2021

The PR is very similar to #59403.

Previously, non-voters on dead or decommissioning nodes would not get
removed or replaced. This commit fixes this behaviour.

Resolves #61626

Release justification: fixes limitation where dead/decommissioning
non-voters would not be removed or replaced
Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch 2 times, most recently from 0c8c92d to b2bba4e Compare March 9, 2021 05:06
@aayushshah15 aayushshah15 requested a review from nvb March 9, 2021 05:08
@aayushshah15
Copy link
Copy Markdown
Contributor Author

Note that I only have an "integration" test for testing removal/replacement of non-voters from decommissioning nodes. I haven't yet figured out a good way to test removing or replacing non-voters from dead nodes. I'll update the PR with a good test for that case, but it should be ready for a look aside from that.

@andreimatei andreimatei added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 9, 2021
@andreimatei andreimatei self-requested a review March 9, 2021 20:40
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/allocator.go, line 114 at r1 (raw file):

// These are the possible allocator actions.
const (
	_ AllocatorAction = iota

Do you have a sense for why we keep actions and priorities separate? It seems like a recipe for bugs. There's an N to 1 mapping between the two concepts, right? So then would a (AllocatorAction).Priority method be appropriate to avoid bugs in ComputeAction?

(no need to hold up this PR on the discussion)


pkg/kv/kvserver/allocator.go, line 572 at r1 (raw file):

	}
	liveNonVoters, deadNonVoters := a.storePool.liveAndDeadReplicas(nonVoterReplicas)
	if haveNonVoters == neededNonVoters && len(deadNonVoters) > 0 {

Let's add comments like we have above that explain in prose what each of these cases means.


pkg/kv/kvserver/replicate_queue.go, line 367 at r2 (raw file):

		return false, nil

	// Add replicas

nit: punctuation on each of these.


pkg/kv/kvserver/replicate_queue.go, line 468 at r2 (raw file):

func getRemoveIdx(repls, deadOrDecommissioningRepls []roachpb.ReplicaDescriptor) (removeIdx int) {
	removeIdx = -1
	if len(deadOrDecommissioningRepls) == 0 {

Aren't we already checking this outside of the function? If so, consider making this a precondition of the method and panicking if it is not upheld.

Or even better, just take a single deadOrDecommissioningRepl roachpb.ReplicaDescriptor arg so that callers aren't tricked into thinking that anyone other than the first deadOrDecommissioned replica is considered.


pkg/kv/kvserver/replicate_queue.go, line 496 at r2 (raw file):

	dryRun bool,
) (requeue bool, _ error) {
	desc, zone := repl.DescAndZone()

It's not entirely new logic, but the repeated calls to repl.DescAndZone() scare me. It seems error prone to call allocator.ComputeAction with one zone and descriptor and then carry out the action with a separate zone and descriptor. Does that concern you at all?


pkg/kv/kvserver/replicate_queue.go, line 928 at r2 (raw file):

	decommissioningReplica := decommissioningReplicas[0]

	if targetType == voterTarget {

We have logic like this in a few places, but I wonder whether we want it. Is there any reason to bake in this assumption that non-voting replicas will never hold the lease? That will probably never change, but is there a downside to gating calls to maybeTransferLeaseAway? Isn't it a quick no-op if the target isn't the leaseholder?


pkg/kv/kvserver/replicate_queue_test.go, line 398 at r2 (raw file):

		require.Eventually(t, func() bool {
			return check(tc, scratchRange)
		}, 15*time.Second, time.Second)

Is a 1 second check a little slow? I'd imagine we'd want to run it every 10 ms or something much faster.


pkg/kv/kvserver/replicate_queue_test.go, line 398 at r2 (raw file):

		require.Eventually(t, func() bool {
			return check(tc, scratchRange)
		}, 15*time.Second, time.Second)

s/15*time.Second/testutils.DefaultSucceedsSoonDuration/


pkg/kv/kvserver/replicate_queue_test.go, line 409 at r2 (raw file):

	}

	t.Run("replace", func(t *testing.T) {

Could you give these two subtests comments explaining what they're testing and why they're different.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got nothing beyond what Nathan said.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 114 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you have a sense for why we keep actions and priorities separate? It seems like a recipe for bugs. There's an N to 1 mapping between the two concepts, right? So then would a (AllocatorAction).Priority method be appropriate to avoid bugs in ComputeAction?

(no need to hold up this PR on the discussion)

+1 and maybe we can get rid of the numerical values of priorities with this occasion, assuming they're meaningless


pkg/kv/kvserver/allocator.go, line 127 at r2 (raw file):

	AllocatorReplaceDecommissioningNonVoter
	AllocatorRemoveDecommissioningVoter
	AllocatorRemoveDecommissioningNonVoter

is there a good reason why there's distinct actions RemoveDead* versus RemoveDecomissioning* ? It would seem to me that a removal is a removal. Perhaps the different priorities are the only reason?


pkg/kv/kvserver/replicate_queue.go, line 928 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We have logic like this in a few places, but I wonder whether we want it. Is there any reason to bake in this assumption that non-voting replicas will never hold the lease? That will probably never change, but is there a downside to gating calls to maybeTransferLeaseAway? Isn't it a quick no-op if the target isn't the leaseholder?

+1
The "maybe" in maybeTransferLeaseAway sounds like it could handle any pre-conditions.

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch 3 times, most recently from 2909e5a to 38b7949 Compare March 11, 2021 02:42
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 114 at r1 (raw file):
The numerical values of the priorities (or something like it) I think do need to exist because of this sort of logic: https://github.com/cockroachdb/cockroach/blob/819d931c6c768f9252cf89b8bc46db72fc1a3175/pkg/kv/kvserver/allocator.go#L483

This logic makes the replicateQueue prioritize ranges that are missing, say 3 voters, over ranges that are missing 1 voter.

then would a (AllocatorAction).Priority method be appropriate to avoid bugs in ComputeAction?

Done. LMK if you dislike what I've done.


pkg/kv/kvserver/allocator.go, line 572 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's add comments like we have above that explain in prose what each of these cases means.

Done.


pkg/kv/kvserver/allocator.go, line 127 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is there a good reason why there's distinct actions RemoveDead* versus RemoveDecomissioning* ? It would seem to me that a removal is a removal. Perhaps the different priorities are the only reason?

We could've had different priorities but the same AllocatorAction, so that's not it. The differences are encapsulated in removeDead and removeDecommissioning. They're fairly minor: slightly different logging, different metric being updated, one tries to maybeTransferLeaseAway while the other can assume that the replica being removed doesn't have the lease.


pkg/kv/kvserver/replicate_queue.go, line 367 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: punctuation on each of these.

Fixed.


pkg/kv/kvserver/replicate_queue.go, line 468 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Aren't we already checking this outside of the function? If so, consider making this a precondition of the method and panicking if it is not upheld.

Or even better, just take a single deadOrDecommissioningRepl roachpb.ReplicaDescriptor arg so that callers aren't tricked into thinking that anyone other than the first deadOrDecommissioned replica is considered.

Done.


pkg/kv/kvserver/replicate_queue.go, line 496 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's not entirely new logic, but the repeated calls to repl.DescAndZone() scare me. It seems error prone to call allocator.ComputeAction with one zone and descriptor and then carry out the action with a separate zone and descriptor. Does that concern you at all?

So the hazard here is just that we might have decided to add / remove / rebalance based on one version of the range descriptor but then execute upon that decision based on another version? This can only happen if we raced with the mergeQueue or the StoreRebalancer, which is definitely possible. However, I don't think anything bad can really happen in such a circumstance since:

  1. Those callers won't add / remove replicas (except for short-lived learners, which should be copacetic), they will just rebalance a range.
  2. The exact replication change that is executed is still governed by the {Allocate, Rebalance, Remove}{Voter, NonVoter} set of methods, so we know that we're not making a decision that could, for instance, drive the range into unavailability or something.
  3. the replicateQueue requeues ranges back to be processed again after it executes upon an allocator action. If we raced with something and somehow ended up making a suboptimal rebalancing decision, we will correct it on the next attempt.

Does that make sense? Do you think this reasoning should end up in the comments?


pkg/kv/kvserver/replicate_queue.go, line 928 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1
The "maybe" in maybeTransferLeaseAway sounds like it could handle any pre-conditions.

Yeah we make this check right at the beginning of maybeTransferLeaseAway anyway. Removed.


pkg/kv/kvserver/replicate_queue_test.go, line 398 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is a 1 second check a little slow? I'd imagine we'd want to run it every 10 ms or something much faster.

I was hoping for the infrequent checks to make it less likely to flake under race. I bumped it to 100ms.


pkg/kv/kvserver/replicate_queue_test.go, line 398 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/15*time.Second/testutils.DefaultSucceedsSoonDuration/

Done.


pkg/kv/kvserver/replicate_queue_test.go, line 409 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give these two subtests comments explaining what they're testing and why they're different.

Done. LMK if they seem lacking.

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from 38b7949 to 3f98b51 Compare March 11, 2021 04:05
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/allocator.go, line 395 at r1 (raw file):

	if a.storePool == nil {
		// Do nothing if storePool is nil for some unittests.
		return AllocatorNoop, 0

I'm very positive on most of this change! But I'm not sure it improved this particular function. The elimination of short-circuiting makes this more difficult to follow. Now every case needs to ask the question: "what happens after this?" I'd encourage us to revert back to the old style and just have a bit of return action, action.Priority() repetition.


pkg/kv/kvserver/replicate_queue.go, line 496 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

So the hazard here is just that we might have decided to add / remove / rebalance based on one version of the range descriptor but then execute upon that decision based on another version? This can only happen if we raced with the mergeQueue or the StoreRebalancer, which is definitely possible. However, I don't think anything bad can really happen in such a circumstance since:

  1. Those callers won't add / remove replicas (except for short-lived learners, which should be copacetic), they will just rebalance a range.
  2. The exact replication change that is executed is still governed by the {Allocate, Rebalance, Remove}{Voter, NonVoter} set of methods, so we know that we're not making a decision that could, for instance, drive the range into unavailability or something.
  3. the replicateQueue requeues ranges back to be processed again after it executes upon an allocator action. If we raced with something and somehow ended up making a suboptimal rebalancing decision, we will correct it on the next attempt.

Does that make sense? Do you think this reasoning should end up in the comments?

Well, I think the point stands that it took a fairly long message to convince me that it isn't subtly broken 😃 And the next reader is going to probably think the same thing, at which point they may be right because things may have changed in very distant parts of the system. This kind of logic is such a red flag because its correctness can't be reasoned about locally. It requires the caller to ask - "what could have happened between the first time I looked and this and the next time I looked at it? Could I have raced with a concurrent modification? If so, in which ways can this state be different now from what it looked like when I decided to take some action?".

Do we gain anything from this structure? Would the fix be as straightforward as plumbing the existing desc and zone down to each of these functions?

Also, no need to make any changes for this here.

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch 3 times, most recently from 4af8e3e to bcd5f1b Compare March 11, 2021 20:27
…voters

This commit teaches the allocator to emit actions to repair a range by
removing/replacing dead or decommissioning non-voters.

Release justification: needed for non-voting replicas
Release note: None
@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from bcd5f1b to 7eedfde Compare March 13, 2021 00:22
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 395 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm very positive on most of this change! But I'm not sure it improved this particular function. The elimination of short-circuiting makes this more difficult to follow. Now every case needs to ask the question: "what happens after this?" I'd encourage us to revert back to the old style and just have a bit of return action, action.Priority() repetition.

Done.


pkg/kv/kvserver/replicate_queue.go, line 496 at r2 (raw file):

I think the point stands that it took a fairly long message to convince me that it isn't subtly broken 😃

Yaa very fair.

Would the fix be as straightforward as plumbing the existing desc and zone down to each of these functions?

I think so, yes. With the plumbing, if we used a stale descriptor, it would get rejected when AdminChangeReplicas checks it against the version in KV.

Also, no need to make any changes for this here.

Added a TODO for now.

Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the PR with a good test for that case, but it should be ready for a look aside from that.

I've added this test. PTAL.

I recognize that this bad boy is blocking the beta, I'll try and get the PR through CI and backport as soon as I get LGTM on the test.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from 7eedfde to 1a53081 Compare March 13, 2021 01:16
@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from 1a53081 to df793fd Compare March 13, 2021 07:01
@aayushshah15 aayushshah15 requested a review from a team as a code owner March 13, 2021 07:01
@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from df793fd to 8c32012 Compare March 13, 2021 07:03
@aayushshah15
Copy link
Copy Markdown
Contributor Author

I've added this test. PTAL.

This was not a good test, I rewrote it in a way that should be not-flakey.

@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from 8c32012 to 3c0f783 Compare March 13, 2021 07:39
The commit is very similar to cockroachdb#59403.

Previously, non-voters on dead or decommissioning nodes would not get
removed or replaced. This commit fixes this behavior.

Release justification: fixes limitation where dead/decommissioning
non-voters would not be removed or replaced
Release note: None
@aayushshah15 aayushshah15 force-pushed the replace-decommissioning-dead-non-voters branch from 3c0f783 to f514c3e Compare March 13, 2021 07:48
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 22, 2021
…ldQueue()`

Before this commit, the `replicateQueue` would refuse to queue up a
replica into the queue (in its `shouldQueue` method) for the rebalancing
case unless it could verify that a voting replica needed to be
rebalanced. This was an unfortunate oversight since it meant that unless
there was a voting replica to be rebalanced, non-voters would not get
rebalanced by the queue.

This commit fixes this bug.

Noticed while debugging a flakey test for cockroachdb#61682

Release justification: bug fix
Release note: None
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

@aayushshah15
Copy link
Copy Markdown
Contributor Author

TFTRs!

bors r+

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 22, 2021
…ldQueue()`

Before this commit, the `replicateQueue` would refuse to queue up a
replica into the queue (in its `shouldQueue` method) for the rebalancing
case unless it could verify that a voting replica needed to be
rebalanced. This was an unfortunate oversight since it meant that unless
there was a voting replica to be rebalanced, non-voters would not get
rebalanced by the queue.

This commit fixes this bug.

Noticed while debugging a flakey test for cockroachdb#61682

Release justification: bug fix
Release note: None
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2021

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2021

Build succeeded:

@craig craig bot merged commit 7899744 into cockroachdb:master Mar 22, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 22, 2021
…ldQueue()`

Before this commit, the `replicateQueue` would refuse to queue up a
replica into the queue (in its `shouldQueue` method) for the rebalancing
case unless it could verify that a voting replica needed to be
rebalanced. This was an unfortunate oversight since it meant that unless
there was a voting replica to be rebalanced, non-voters would not get
rebalanced by the queue.

This commit fixes this bug.

Noticed while debugging a flakey test for cockroachdb#61682

Release justification: bug fix
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: remove/replace decommissioning and dead non-voters

4 participants