kvserver: ensure cleanup of all abandoned learners before an AdminRelocateRange#80205
kvserver: ensure cleanup of all abandoned learners before an AdminRelocateRange#80205aayushshah15 wants to merge 1 commit intocockroachdb:masterfrom
AdminRelocateRange#80205Conversation
|
@nvanbenschoten, this PR is just a proposal and not something I'm super sold on. What do you think? |
…locateRange` Previously, after cockroachdb#79405, `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` could return a range descriptor that contained learner replicas. This was because of the idempotency introduced in cockroachdb#79405, which relaxed how/when we failed replication changes due to descriptor mismatch. The hazard was that, while we would ensure that the learner we were trying to remove was indeed gone, we wouldn't do enough to ensure that there weren't new learners that were added to the range due to other concurrent `AdminRelocateRange` calls. This was a violation of its contract and could sometimes fail `AdminRelocateRange` requests. This commit makes it such that when `maybeLeaveAtomicChangeReplicasAndRemoveLearners` returns a non-error response, it guarantees that _all_ the learner replicas have been removed according to the returned range descriptor. Release note: None
3aae6de to
f4d79e9
Compare
This commit is an alternative to cockroachdb#80205. In cockroachdb#79405, we relaxed the check that `AdminChangeReplicas` performs comparing the expected range descriptor state against the range descriptor in KV when we were performing single learner removals. This was because these learner removals are performed at the end of a replication change (i.e. after its already completed) and thus, could be made idempotent. The above change messed with a subtle part of the contract of `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` since now it was no longer guaranteed to always successfully return with a range desc that did not contain learners (due to intervening replication changes that were now no longer detected because of the relaxation mentioned above). This commit fixes this regression by making `maybeLeaveAtomicChangeReplicasAndRemoveLearners` check at the end if we've raced with another replication change. If so, we return an error instead of incorrectly returning a successful response with a descriptor that could still have learners or be in a joint config. Release note: None
nvb
left a comment
There was a problem hiding this comment.
I think I like this approach better than #80205. That PR feels like a partial regression of #79405, so I don't understand the benefit of a state with both of those PRs vs. a state with neither.
However, this PR does pose a risk of getting stuck in the loop if some other actor continuously adds learners.
I wonder if there's a third option that scopes down the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners at the end of changeReplicasImpl. If we only want the demoted learners to be removed in this case, why are we requiring that all learners be removed?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go line 1324 at r1 (raw file):
// Now the config isn't joint any more, but we may have demoted some voters // into learners. These learners should go as well. for {
Now that #79379 has landed, Should we be checking hasOutstandingLearnerSnapshotInFlight inside of this loop?
pkg/kv/kvserver/replica_command.go line 1338 at r1 (raw file):
var err error log.VEventf(ctx, 2, `removing learner replica %v from %v`, removalTarget, desc) desc, err = execChangeReplicasTxn(
Having seen fallout already, I'm somewhat wary that #79405 might have other consequences for callers that are removing a single learner replica and expect execChangeReplicasTxn to be a true compare-and-swap on the full descriptor. Do you think that's a concern? If so, do you think there's merit in using changeReplicasTxnArgs to scope these limited compare-and-swap fast-paths to just the callers that want them, instead of inferring them from the shape of the internalReplicationChanges?
Previously, after #79405,
maybeLeaveAtomicChangeReplicasAndRemoveLearners()could return a rangedescriptor that contained learner replicas. This was because of the idempotency
introduced in #79405, which relaxed how/when we failed replication changes due
to descriptor mismatch. The hazard was that, while we would ensure that the
learner we were trying to remove was indeed gone, we wouldn't do enough to
ensure that there weren't new learners that were added to the range due to
other concurrent
AdminRelocateRangecalls.This was a violation of its contract and could sometimes fail
AdminRelocateRangerequests.This commit makes it such that when
maybeLeaveAtomicChangeReplicasAndRemoveLearnersreturns a non-error response, it guarantees that all the learner replicas
have been removed according to the returned range descriptor.
Resolves #79887
Release note: None