kvserver: gracefully handle races during learner removals#79405
Conversation
4d1607f to
2559fcc
Compare
|
cc @tbg, feel free to ignore this, but I wanted to add you in case you see something obviously wrong with doing what we're doing in this patch. |
|
Looks good! Not formally signing off because I have late night brain mush but I did leave some suggestions and can give a real review when the next turnaround comes. |
394efdf to
1efaf60
Compare
nvb
left a comment
There was a problem hiding this comment.
I did leave some suggestions
Hm, I'm not seeing these suggestions.
Reviewed 1 of 2 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
return nil, errors.AssertionFailedf("cannot transition from %s to VOTER_OUTGOING", prevTyp) } else { rDesc, _, _ = updatedDesc.SetReplicaType(chg.target.NodeID, chg.target.StoreID, roachpb.VOTER_OUTGOING)
Is this case still valid? Or does the check in execReplicationChangesForVoters that switches from a internalChangeTypeRemoveLearner to a internalChangeTypeDemoteVoterToLearner make it dead code?
pkg/kv/kvserver/replica_command.go, line 2210 at r2 (raw file):
learnerAlreadyRemoved := true check := func(kvDesc *roachpb.RangeDescriptor) bool {
nit: could we name this return arg?
pkg/kv/kvserver/replica_command.go, line 2230 at r2 (raw file):
for _, repl := range kvDesc.Replicas().Descriptors() { if repl.StoreID == chgs[0].target.StoreID { learnerAlreadyRemoved = false
nit: break
tbg
left a comment
There was a problem hiding this comment.
Sorry hadn't hit the submit button
pkg/kv/kvserver/replica_command.go
Outdated
| returnDesc = desc | ||
| return nil | ||
| } | ||
| if chgs.isSingleLearnerRemoval() && learnerAlreadyRemoved { |
pkg/kv/kvserver/replica_command.go
Outdated
|
|
||
| descKey := keys.RangeDescriptorKey(referenceDesc.StartKey) | ||
|
|
||
| learnerAlreadyRemoved := true |
There was a problem hiding this comment.
This seems a bit iffy tbh. Relying on a closure that is invoked a few call stacks down and its ordering here is fairly intransparent. I spent around ten minutes trying to think of an elegant way to refactor these checks, but came up short. But I do feel sort of strongly that we should avoid the opaque mutation to learnerAlreadyRemoved. Perhaps you can pull out a closure that we invoke in both places? Something like
isIdempotentLearnerRemoval := func(actual *roachpb.RangeDescriptor, storeID roachpb.StoreID) bool {
// ...
}You can invoke that in check, and again when short-circuiting the txn.
But maybe we can do one better, and return a skip bool from check and pass it through from conditionalGetDescValueFromDB?
desc, dbDescValue, skip, err := conditionalGetDescValueFromDB(
ctx, txn, referenceDesc.StartKey, false /* forUpdate */, check)
if err != nil { ... }
if skip {
// The new descriptor already reflects what we needed to get done.
return nil
}If this fits all existing uses, that would be a nice solution.
There was a problem hiding this comment.
I tried the second option you suggested, how do you feel about it?
1efaf60 to
03d2b33
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this case still valid? Or does the check in
execReplicationChangesForVotersthat switches from ainternalChangeTypeRemoveLearnerto ainternalChangeTypeDemoteVoterToLearnermake it dead code?
Since VOTER_OUTGOINGs are not a thing anymore, and neither are non-atomic replication changes, this area as a whole could use a proper cleanup. I considered adding a clean-up commit at the end, but there are a few test flakes I'd need to iron out. Do you mind if we do this cleanup separately? I'd also like to backport this, so it's probably better to keep this patch small.
pkg/kv/kvserver/replica_command.go, line 2210 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: could we name this return arg?
Done.
pkg/kv/kvserver/replica_command.go, line 2230 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
break
Done.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Since
VOTER_OUTGOINGs are not a thing anymore, and neither are non-atomic replication changes, this area as a whole could use a proper cleanup. I considered adding a clean-up commit at the end, but there are a few test flakes I'd need to iron out. Do you mind if we do this cleanup separately? I'd also like to backport this, so it's probably better to keep this patch small.
Yes, let's absolutely keep this separate.
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
// The supplied `check` method verifies whether the descriptor fetched from kv // matches expectations and returns a "skip" hint that the caller can use to // optionally elide work that doesn't need to be done.
We should mention here that if a function returns skip = true, matched should also be true. Maybe even assert that below.
Or we could change this contract to require the opposite (skip = true => matches = false). It feels somewhat strange that we're pretending something matched when it did not. If we could express the condition "did not match, but that's ok", we'd be saying exactly what we mean.
@tbg might have thoughts.
pkg/kv/kvserver/replica_command.go, line 2747 at r3 (raw file):
existingDescKV, err := get(ctx, descKey) if err != nil { return nil, nil, skip, errors.Wrap(err, "fetching current range descriptor value")
nit: return false directly so that readers don't need to check whether skip has already been set. Same thing a few lines down.
03d2b33 to
44052fd
Compare
Previously, at the end of a replication change, if the `ChangeReplicas` request found that the (demoted) learner replica it was trying to remove from the range had already been removed (presumably because it raced with the mergeQueue, the `StoreRebalancer`, or something else), it would error out. This was unfortunate, because, for all practical purposes, the replication change _had succeeded_. We can now gracefully handle this instead by no-oping if we detect that the replica we were trying to remove has already been removed. Release note: None
44052fd to
8b34184
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should mention here that if a function returns
skip = true,matchedshould also be true. Maybe even assert that below.Or we could change this contract to require the opposite (
skip = true => matches = false). It feels somewhat strange that we're pretending something matched when it did not. If we could express the condition "did not match, but that's ok", we'd be saying exactly what we mean.@tbg might have thoughts.
I went with the second option you proposed. How do you feel about it? Did you want me to also assert that when skip==true that matched must be false?
pkg/kv/kvserver/replica_command.go, line 2747 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: return
falsedirectly so that readers don't need to check whetherskiphas already been set. Same thing a few lines down.
Done.
tbg
left a comment
There was a problem hiding this comment.
LGTM to me, letting Nathan put the 👔 on it
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I went with the second option you proposed. How do you feel about it? Did you want me to also assert that when
skip==truethatmatchedmust befalse?
Looks good to me, thanks for chewing through this.
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r1, 4 of 5 files at r2, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
|
TFTRs bors r+ |
|
I think we could reasonably wait to backport this until the first or second patch release since the race this PR fixes isn't nearly as likely as #79379. |
|
Build succeeded: |
…locateRange` Previously, after cockroachdb#79405, `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` could return a range descriptor that contained learner replicas. This was because of the idempotency introduced in cockroachdb#79405, which relaxed how/when we failed replication changes due to descriptor mismatch. The hazard was that, while we would ensure that the learner we were trying to remove was indeed gone, we wouldn't do enough to ensure that there weren't new learners that were added to the range due to other concurrent `AdminRelocateRange` calls. This was a violation of its contract and could sometimes fail `AdminRelocateRange` requests. This commit makes it such that when `maybeLeaveAtomicChangeReplicasAndRemoveLearners` returns a non-error response, it guarantees that _all_ the learner replicas have been removed according to the returned range descriptor. Release note: None
This commit is an alternative to cockroachdb#80205. In cockroachdb#79405, we relaxed the check that `AdminChangeReplicas` performs comparing the expected range descriptor state against the range descriptor in KV when we were performing single learner removals. This was because these learner removals are performed at the end of a replication change (i.e. after its already completed) and thus, could be made idempotent. The above change messed with a subtle part of the contract of `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` since now it was no longer guaranteed to always successfully return with a range desc that did not contain learners (due to intervening replication changes that were now no longer detected because of the relaxation mentioned above). This commit fixes this regression by making `maybeLeaveAtomicChangeReplicasAndRemoveLearners` check at the end if we've raced with another replication change. If so, we return an error instead of incorrectly returning a successful response with a descriptor that could still have learners or be in a joint config. Release note: None
Previously, at the end of a replication change, if the
ChangeReplicasrequestfound that the (demoted) learner replica it was trying to remove from the range
had already been removed (presumably because it raced with the mergeQueue, the
StoreRebalancer, or something else), it would error out. This wasunfortunate, because, for all practical purposes, the replication change had
succeeded.
We can now gracefully handle this instead by no-oping if we detect that the
replica we were trying to remove has already been removed.
Relates to #79118
Release note: None
Jira issue: CRDB-14809