kvserver: lease transfer in JOINT configuration by shralex · Pull Request #74077 · cockroachdb/cockroach

shralex · 2021-12-20T06:32:21Z

Previously:

Removing a leaseholder was not allowed.
A VOTER_INCOMING node wasn't able to accept the lease.

Because of (1), users needed to transfer the lease before removing
the leaseholder. Because of (2), when relocating a range from the
leaseholder A to a new node B, there was no possibility to transfer
the lease to B before it was fully added as VOTER. Adding it as a
voter, however, could degrade fault tolerance. For example, if A
and B are in region R1, C in region R2 and D in R3, and we had
(A, C, D), and now adding B to the cluster to replace A results in
the intermediate configuration (A, B, C, D) the failure of R1 would
make the cluster unavailable since no quorum can be established.
Since B can't be added before A is removed, the system would
transfer the lease out to C, remove A and add B, and then transfer
the lease again to B. This resulted a temporary migration of leases
out of their preferred region, imbalance of lease count and degraded
performance.

The PR fixes this, by (1) allowing removing the leaseholder, and
transferring the lease right before we exit the JOINT config. And (2),
allowing a VOTER_INCOMING to accept the lease.

Release note (performance improvement): Fixes a limitation which meant
that, upon adding a new node to the cluster, lease counts among existing
nodes could diverge until the new node was fully upreplicated.

Here are a few experiments that demonstrate the benefit of the feature.
1.

roachprod create local -n 4 // if not already created and staged
roachprod put local cockroach
roachprod start local:1-3 --racks=3 // add 3 servers in 3 different racks
cockroach workload init kv --splits=10000
roachprod start local:4 --racks=3 // add a 4th server in one of the racks

Without the change (master):

With the change:

We can see that without the patch the number of leases on server 0 (black line) goes all the way to 0 before it goes back up and that the number of leases in other racks goes up, both undesirable. With the patch both things are no longer happening.

Same as 1, but with a leaseholder preference of rack 0:

ALTER RANGE default CONFIGURE ZONE USING lease_preferences='[[+rack=0]]';

Without the change (master):

With the change:

We can see that without the change the number of leaseholders in racks 1 and 2 together (not in preferred region) grows from 300 to 1000, then goes back to 40. With the fix it doesn’t grow at all.

cockroach-teamcity · 2021-12-20T06:32:27Z

This change is

nvb

It's exciting to see this! When you get a chance, let's try out the repro steps from #67740 (comment) and see how things behave with this change.

@tbg should probably also give this a pass when it gets closer, as he was the author of a lot of this joint configuration code. I also left a comment for him in replica_raft.go.

Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @shralex)

-- commits, line 14 at r1:
nit: s/a/A/

-- commits, line 29 at r1:
Consider adding a release note to this commit. It's a valuable change that users will care about!

pkg/kv/db.go, line 629 at r1 (raw file):

	err := db.Run(ctx, b)
	if err == nil {
		log.Infof(ctx, "Transferring lease to StoreID %s succeeded", target.String())

Was this just added for debugging purposes, or did you intend for us to keep it?

pkg/kv/kvserver/replica_command.go, line 985 at r1 (raw file):

	details string,
	chgs roachpb.ReplicationChanges,
	repl *Replica,

Isn't this already the method receiver?

pkg/kv/kvserver/replica_command.go, line 1172 at r1 (raw file):

// NON_VOTER, and VOTER_FULL only.
func maybeLeaveAtomicChangeReplicas(
	ctx context.Context, s *Store, desc *roachpb.RangeDescriptor, r *Replica,

Instead of adding this as a parameter, what do you think about making maybeLeaveAtomicChangeReplicas and maybeLeaveAtomicChangeReplicasAndRemoveLearners methods on *Replica? That way, we can actually get rid of all arguments except for the context.

pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):

	// with leaving a joint state when it's in one, so make sure we don't call
	// it if we're not.
	if !desc.Replicas().InAtomicReplicationChange(){

It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.

pkg/kv/kvserver/replica_command.go, line 1184 at r1 (raw file):

	voters := desc.Replicas().VoterDescriptors()
	if r != nil {

The fact that we can ignore this check seems like a source of bugs. The need for this is coming from the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners on the RHS range in mergeQueue.process, right? Did you run into issues with the proposed idea to remove that call entirely and enter the AdminRelocateRange branch when the RHS is in a joint config instead?

pkg/kv/kvserver/replica_command.go, line 1185 at r1 (raw file):

	voters := desc.Replicas().VoterDescriptors()
	if r != nil {
		// Current leaseholder is being removed

Let's add more to this comment. We can discuss why we're looking in the voters slice, what it means for the current replica (which we know to be the leaseholder) to be absent from it, and what we will do in that case (transfer the lease).

pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):

				transferLeaseOptions{
					goal:                               followTheWorkload,
					checkTransferLeaseSource:           true,

Consider adding a comment about why we set this to true. Something about having already filtered out the current leaseholder.

pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

					checkTransferLeaseSource:           true,
					checkCandidateFullness:             false,
					disableLeaseCountConvergenceChecks: true,

Could you add a comment about why we set this? I don't understand the full flow in Allocator.TransferLeaseTarget, but it's interesting to me that we have to set this here, but not elsewhere. For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway. Should we be doing so? @aayushshah15 do you know what this is about?

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

			if target == (roachpb.ReplicaDescriptor{}) {
				log.Infof(ctx, "could not find a better lease transfer target for r%d",
					desc.RangeID)

Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.

Also, regardless of whether we log (log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.

pkg/kv/kvserver/replica_command.go, line 1213 at r1 (raw file):

					desc.RangeID)
			} else {
				log.Infof(ctx, "current leaseholder %v is being removed. Transferring lease to %v",

minor nit: s/. T/, t/

Also, s/is being removed/is being removed through an atomic replication change/

pkg/kv/kvserver/replica_command.go, line 2787 at r1 (raw file):

			}

			ops, err := s.relocateOne(ctx, &rangeDesc, voterTargets, nonVoterTargets)

I'm surprised we didn't need to revert this part of the change because of single-step config changes. What will happen if I have a config that looks like {1 [L], 2, 3} and I issue an AdminRelocateRange that drops the config down to {2 [L], 3}?

pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

		prefix = false

		// The following deals with removing a leaseholder. A voter can be removed in two ways.

Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.

Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn directly. @tbg are you aware of any tests at this level?

EDIT: maybe we can do something around TestLeaseHolderRemoveSelf to test this?

pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):

	// to disregard the existing lease counts on candidates.
	checkCandidateFullness bool
	// when true, ignores lease count convergence considerations

nit: missing punctuation.

pkg/roachpb/metadata.go, line 488 at r1 (raw file):

}

func (r ReplicaDescriptor) IsAnyVoter() bool {

This will need a comment. It will probably look like the two above.

pkg/roachpb/metadata_replicas.go, line 527 at r1 (raw file):

	if !ok {
		return errReplicaNotFound
	} else if !repDesc.IsVoterNewConfig() {

This should probably also go in its own commit/PR, with some testing, for the same reasons as listed above.

pkg/testutils/testcluster/testcluster.go, line 886 at r1 (raw file):

) {
	if err := tc.TransferRangeLease(rangeDesc, dest); err != nil {
		t.Fatalf(`couldn not transfer lease for range %s error is %+v`, rangeDesc, err)

"could not" or "couldn't"

pkg/testutils/testcluster/testcluster.go, line 897 at r1 (raw file):

) {
	testutils.SucceedsSoon(t, func() error {
		if _, err := tc.RemoveVoters(rangeDesc.StartKey.AsRawKey(), src); err != nil {

I'm surprised this works. Are we entering a joint config here?

pkg/kv/kvserver/client_lease_test.go, line 640 at r1 (raw file):

	tc.TransferRangeLeaseOrFatal(t, rhsDesc, tc.Target(0))

	// Check that the lease is on 1

nit: punctuation needed at the end of a lot of these comments.

pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):

	require.Equal(t, tc.Target(2), leaseHolder)

	gossipLiveness(t, tc)

Why is this needed?

pkg/kv/kvserver/client_lease_test.go, line 657 at r1 (raw file):

	// Relocate range 3 -> 4.
	_, err = db.Exec("ALTER RANGE " + rhsDesc.RangeID.String() + " RELOCATE FROM 3 TO 4")

Isn't not very common to see SQL used in these tests to perform this kind of change. It's more common to see kv.DB.AdminChangeReplicas used, or even better, one of the TestCluster helper methods. Maybe we need a new helper method?

pkg/kv/kvserver/client_lease_test.go, line 665 at r1 (raw file):

	require.Equal(t, tc.Target(3), leaseHolder)

	// Double check that lease moved directly 3 -> 4

This is neat!

pkg/kv/kvserver/client_raft_test.go, line 3803 at r1 (raw file):

nit: stray line. That may go away with a gofmt.

pkg/kv/kvserver/client_relocate_range_test.go, line 239 at r1 (raw file):

		})
	}

tiny nit: based on the rest of this test, it seems like we wanted this new line for consistency.

nvb · 2021-12-21T02:42:13Z

I remembered another comment I meant to leave here. We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1.

tbg

Gave a shallow pass, generally looks good. Sad that all the messy lease stuff is now intertwining with the replication changes but I suppose that's what we signed up for. Still, if there's a way to clean that up I would be in favor of that. Maybe all of the lease handling doesn't have to be inline but can be pulled out into something compact that we can call in the right place.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @shralex)

pkg/kv/db.go, line 629 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this just added for debugging purposes, or did you intend for us to keep it?

These should go away, we shouldn't be logging in *kv.DB. If anything, these logs would be expected at the server that executes the actual transfer.

Also, general nits:

log messages start lowercase
don't need to call target.String, just target should do fine (%s verb invokes String())
we usually refer to stores as s%d in logging

pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

				fallthrough
			case decideWithoutStats:
				if !opts.disableLeaseCountConvergenceChecks &&

was this just missing & you noticed while writing this PR?

pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):

				if err != nil {
					return nil, err
				}

Are we sure the lease is now where it should be? There's obviously the case where for some reason the lease gets transferred somewhere else by a third party. Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below? Similarly, are there persistent error cases where the lease transfer (or one of the steps above) fail? I assume there are no new concerns here since we were never previously able to remove the leaseholder and so nothing has really changed. But I am vaguely worried that what we do now isn't 100% equivalent and that we need a fail-safe (which the migration forces us to have anyway).

pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.

Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn directly. @tbg are you aware of any tests at this level?

EDIT: maybe we can do something around TestLeaseHolderRemoveSelf to test this?

I don't think there's direct testing of this, unfortunately. Perhaps a reasonable way to get what we want is by inserting testing knobs between the steps that we can use to check on the replication change at various stages.

pkg/kv/kvserver/replica_raft.go, line 323 at r1 (raw file):

		// and then transfer lease directly to v4, but this would change the number of replicas to 4,
		// and if region1 goes down, we loose a quorum. Instead, we move to a joint config where v1
		// (VOTER_DEMOTED_LEARNER) transfer the lease to v4 (VOTER_INCOMING) directly.

DEMOTING

pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing punctuation.

capitalization too and usually (though redundantly) the comment would say

// disableLeaseCountConvergenceChecks, when true, ...

pkg/kv/kvserver/client_raft_test.go, line 3826 at r1 (raw file):

	require.NoError(t, err)

	// Check that lease moved to server 2

.

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)

pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway

I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks as its currently written makes it such that we're forced to use followTheWorkload or return an empty result.

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.

Also, regardless of whether we log (log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.

+1 I'm not sure we need to call into TransferLeaseTarget at all here. Zone configuration lease preferences must be compatible with the constraints placed on the range. So we should never be trying to replace the current leaseholder with another replica that is in violation of those lease preferences. Given all this, I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder.

Another thing to note is that part of AdminRelocateRange's contract is that it must transfer the lease to the first voting replica in the input slice. This is done here at the end:

cockroach/pkg/kv/kvserver/replica_command.go

Lines 2791 to 2802 in 527263a

    
           if len(ops) == 0 { 
        
           	if len(voterTargets) > 0 { 
        
           		// NB: we may need to transfer even if there are no ops, to make 
        
           		// sure the attempt is made to make the first target the final 
        
           		// leaseholder. If there are ops, lease transfer will happen during joint config exit, 
        
           		// if leader is removed. 
        
           		if err := transferLease(voterTargets[0]); err != nil { 
        
           			return rangeDesc, err 
        
           		} 
        
           	} 
        
           	return rangeDesc, ctx.Err() 
        
           }

This further confuses this call into TransferLeaseTarget for me.

pkg/kv/kvserver/replica_command.go, line 2795 at r1 (raw file):

					// NB: we may need to transfer even if there are no ops, to make
					// sure the attempt is made to make the first target the final
					// leaseholder. If there are ops, lease transfer will happen during joint config exit,

nit: this addition is a bit out of place here, and wrapping.

pkg/kv/kvserver/replica_command.go, line 2796 at r1 (raw file):

					// sure the attempt is made to make the first target the final
					// leaseholder. If there are ops, lease transfer will happen during joint config exit,
					// if leader is removed.

s/leader/leaseholder

pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

		prefix = false

		// The following deals with removing a leaseholder. A voter can be removed in two ways.

note that comments should be wrapped at 80 chars (see: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/181371303/Go+Golang+coding+guidelines#Line-Length) here and elsewhere.

aayushshah15 · 2021-12-21T17:18:01Z

We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1.

To add to this, we will also still need the removed allocator logic under mixed-version states.

aayushshah15 · 2021-12-21T17:20:04Z

pkg/kv/kvserver/replicate_queue.go, line 543 at r1 (raw file):

		// See about transferring the lease away if we're about to remove the
		// leaseholder.
		done, err := rq.maybeTransferLeaseAway(

This call should also probably go away.

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

-- commits, line 29 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding a release note to this commit. It's a valuable change that users will care about!

Sure, I'll add before this is merged, once we converge.

pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

was this just missing & you noticed while writing this PR?

I reverted this change, as the same can be achieved by setting checkTransferLeaseSource to false

pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.

thanks! I set it up previously but it somehow disappeared...

pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding a comment about why we set this to true. Something about having already filtered out the current leaseholder.

actually I think it should be false because of the lease count checks in the TransferLeaseTarget method

pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway

I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks as its currently written makes it such that we're forced to use followTheWorkload or return an empty result.

What I was observing is that when the lease counts are roughly similar, this method would return an empty result. I now removed this new option, since checkTransferLeaseSource=false would achieve what I needed here.

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Updated this to return an error.

I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder

It might be better to make this code a bit more general, such that it works even if there are multiple incoming nodes, or none. Also, there are cases where the joining node is not the best choice. Running the selection logic addresses both issues.

pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this needed?

Without this, the liveness status of candidate nodes is unknown, and they are excluded from consideration when we're trying to decide where to transfer the lease.

…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None

Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None

…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None

Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards cockroachdb#74077 Release note: None

…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards cockroachdb#74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None

74479: backupccl: replace memory accumulator with bound account r=dt a=adityamaru This change replaces the memory accumulator with a raw bound account. The memory accumulator provides a threadsafe wrapper around the bound account, and some redundant resource pooling semantics. Both of these are currently not needed by the sstSink that is being memory monitored. This change deletes the memory accumulator. Release note: None 74546: kvserver: allow VOTER_INCOMING to receive the lease r=shralex a=shralex Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None 74549: kvserver: determine if leaseholder is removed using proposed replica … r=shralex a=shralex …descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: shralex <shralex@gmail.com>

nvb

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):

Previously, shralex wrote…

I agree we should investigate and improve this, but perhaps as a follow-up patch ? since this seems like a separate (or at least separable) issue.

I'm usually very pro-follow-up patch, but in this case, I think the flakiness is concerning enough to try to understand now. It may hint at a race that's going to make all of this unstable in practice. For instance, if we're not accounting for the learner snapshot when determining whether the voter_incoming can become the leaseholder, we may frequently roll back the entire config change. Learning a bit more about what's going on would help de-risk this change. If it is a testing-only flake then we can save the deflaking for a follow-up patch.

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/replica_command.go, line 2764 at r9 (raw file):

Previously, shralex wrote…

Done.

Done.

pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm usually very pro-follow-up patch, but in this case, I think the flakiness is concerning enough to try to understand now. It may hint at a race that's going to make all of this unstable in practice. For instance, if we're not accounting for the learner snapshot when determining whether the voter_incoming can become the leaseholder, we may frequently roll back the entire config change. Learning a bit more about what's going on would help de-risk this change. If it is a testing-only flake then we can save the deflaking for a follow-up patch.

Sounds good

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):

Previously, shralex wrote…

Sounds good

Raft progress, which we use to determine whether the replica may need a snapshot, is managed by Raft, which expects one more log update after the snapshot before marking progress as "replicating". Discussed this with @tbg and we decided to go with a simpler solution - to consider any replica in the VOTER_INCOMING state as one that we shouldn't filter out based on "may need a snapshot", since we only move to that state after it got a snapshot.

nvb

Reviewed 2 of 3 files at r10, 12 of 12 files at r11, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/allocator.go, line 1948 at r11 (raw file):

func replicaMayNeedSnapshot(raftStatus *raft.Status, replica roachpb.ReplicaDescriptor) bool {
	// When adding replicas, we only move them from LEARNER to VOTER_INCOMING after
	// they applied the snapshot

Do you mind pointing at the code that does this: initializeRaftLearners.

also, nit: punctuation.

pkg/kv/kvserver/replica_command.go, line 3101 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's somewhat disconcerting to see so many calls to IsActive(EnableLeaseHolderRemoval). The cluster version state is an atomic that can be changed at any time, so each of these calls needs to be the one that's prepared to see the active status switch from false to true. Furthermore, the cluster version is monotonic on a single node, but different nodes can see this active status switch at different times. Are we confident that we're handling things correctly in each case?

As a general design principle with version upgrades like this, it's helpful to try to find a single "gate" where the version can be checked. This eliminates the number of states that the system can be in midway through a version upgrade. With this change, we may need two:

a best-effort, graceful one in the replicateQueue to decide whether to attempt a leaseholder-removing replication change or whether to use the old logic

a strict one in Replica.propose to decide whether a leaseholder can enter a joint config that removes itself. This one is currently missing.

I guess we may also need one more somewhere in relocateReplicas to decide whether to attempt a leaseholder-removing replication change or whether to use the old logic.

I think this is my last remaining comment!

pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):

Previously, shralex wrote…

Raft progress, which we use to determine whether the replica may need a snapshot, is managed by Raft, which expects one more log update after the snapshot before marking progress as "replicating". Discussed this with @tbg and we decided to go with a simpler solution - to consider any replica in the VOTER_INCOMING state as one that we shouldn't filter out based on "may need a snapshot", since we only move to that state after it got a snapshot.

That's a good idea. Glad we found a reasonable way to do this.

nvb

The latest round of changes

Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Do we understand the case(s) where TransferLeaseTarget would ever return an empty target when checkTransferLeaseSource = false? If it is somehow possible for TransferLeaseTarget to produce an empty target, then that seems like something we need to address because as Nathan was saying above, this means that the range would potentially never be able to get out of its joint state. We should have a log.Fatal() if this returns an empty target.

It looks like TransferLeaseTarget could return an empty target if all the other replicas of the range are on suspect or draining stores. The suspect state shouldn't be permanent (though it could last for a while), and the draining state is essentially permanent unless the node is restarted. We could perhaps fix this by having the caller ask TransferLeaseTarget to consider all candidate stores (i.e. use storeFilterNone instead of storeFilterSuspect).

Also, what do you think about lifting this call to TransferLeaseTarget into an appropriately named helper that has comments outlining why we need to ensure that it produces a different lease target?

This method is pretty complex and depends on many flags, so my confidence isn't very high. But with the current set of flags I'm providing (which I picked to do the least possible filtering of candidates), two cases where an empty set could still be returned seem to be suspect/unknown/draining nodes and nodes needing a snapshot. I added another test to make sure that if we have a preferred region but nodes aren't available there, this would not stop the lease from transferring somewhere. I also made sure that incoming nodes aren't filtered due to the snapshot heuristic. As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.

pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below?

I think the way this should work is that lease transfers to a voter in the process of being removed (VOTER_DEMOTING or VOTER_OUTGOING) should be disallowed, so once we move the lease away, nothing should be able to move it back.

Right, this check is done in CheckCanReceiveLease

aayushshah15

except for that one comment. It's exciting to see this land!

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.

A range not being able to exit the joint config will wedge splits, merges, and all rebalancing on the range (including load-based rebalancing). I'm particularly worried about wedging rebalancing on the range since that affects our ability to get the range off of suspect stores.

Maybe its all fine but it adds an additional consideration that should at least be properly documented here. Did we consider pulling this call out into a helper that documents this discussion?

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.

A range not being able to exit the joint config will wedge splits, merges, and all rebalancing on the range (including load-based rebalancing). I'm particularly worried about wedging rebalancing on the range since that affects our ability to get the range off of suspect stores.

Maybe its all fine but it adds an additional consideration that should at least be properly documented here. Did we consider pulling this call out into a helper that documents this discussion?

This seems like an inherent risk with this whole approach. I pulled this code into a separate method, please let me know if this is what you had in mind. One thing we could perhaps do is call TransferLeaseTarget before entering the joint config to see if there are suitable candidates and not enter the config if there aren't. It does not guarantee that we'll be able to exit, but perhaps reduces the vulnerability. wdyt ? if you like this idea, any suggestion where to make this call ?

Previously: 1. Removing a leaseholder was not allowed. 2. A VOTER_INCOMING node wasn't able to accept the lease. Because of (1), users needed to transfer the lease before removing the leaseholder. Because of (2), when relocating a range from the leaseholder A to a new node B, there was no possibility to transfer the lease to B before it was fully added as VOTER. Adding it as a voter, however, could degrade fault tolerance. For example, if A and B are in region R1, C in region R2 and D in R3, and we had (A, C, D), and now adding B to the cluster to replace A results in the intermediate configuration (A, B, C, D) the failure of R1 would make the cluster unavailable since no quorum can be established. Since B can't be added before A is removed, the system would transfer the lease out to C, remove A and add B, and then transfer the lease again to B. This resulted a temporary migration of leases out of their preferred region, imbalance of lease count and degraded performance. The PR fixes this, by (1) allowing removing the leaseholder, and transferring the lease right before we exit the JOINT config. And (2), allowing a VOTER_INCOMING to accept the lease. Release note (performance improvement): Fixes a limitation which meant that, upon adding a new node to the cluster, lease counts among existing nodes could diverge until the new node was fully upreplicated.

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)

pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

Previously, shralex wrote…

This seems like an inherent risk with this whole approach. I pulled this code into a separate method, please let me know if this is what you had in mind. One thing we could perhaps do is call TransferLeaseTarget before entering the joint config to see if there are suitable candidates and not enter the config if there aren't. It does not guarantee that we'll be able to exit, but perhaps reduces the vulnerability. wdyt ? if you like this idea, any suggestion where to make this call ?

Thanks, that is what I had in mind. We could do what you're suggesting (by faking the new incoming replica and passing that into TransferLeaseTarget), if so I'd suggest doing it where we're checking if we're removing the leaseholder before creating the learner replicas (and if so, if the version gate allows us to do so). We could also do that as a follow up patch.

shralex · 2022-02-18T23:34:05Z

bors r+

craig · 2022-02-19T00:32:43Z

Build succeeded:

GitHub CI (Cockroach)

shralex requested a review from a team as a code owner December 20, 2021 06:32

shralex requested review from aayushshah15 and nvb December 20, 2021 06:32

shralex force-pushed the remove_leaseholder branch from 304a581 to df65c65 Compare December 20, 2021 06:55

shralex linked an issue Dec 20, 2021 that may be closed by this pull request

kvserver: lease counts diverge when a new node is added to a cluster #67740

Closed

shralex force-pushed the remove_leaseholder branch 2 times, most recently from 2a227eb to 527263a Compare December 20, 2021 07:05

nvb reviewed Dec 20, 2021

View reviewed changes

tbg requested a review from nvb December 21, 2021 10:33

tbg reviewed Dec 21, 2021

View reviewed changes

tbg mentioned this pull request Dec 21, 2021

server: decommission->decommissioned transition causes abrupt loss of service, need to introduce final wait #72754

Open

aayushshah15 suggested changes Dec 21, 2021

View reviewed changes

shralex force-pushed the remove_leaseholder branch from 527263a to 4182acc Compare January 5, 2022 01:21

shralex commented Jan 5, 2022

View reviewed changes

shralex mentioned this pull request Jan 5, 2022

kvserver: determine if leaseholder is removed using proposed replica … #74438

Closed

shralex mentioned this pull request Jan 5, 2022

kvserver: allow VOTER_INCOMING to receive the lease #74439

Closed

shralex mentioned this pull request Jan 6, 2022

kvserver: allow VOTER_INCOMING to receive the lease #74546

Merged

shralex mentioned this pull request Jan 6, 2022

kvserver: determine if leaseholder is removed using proposed replica … #74549

Merged

nvb reviewed Feb 14, 2022

View reviewed changes

shralex commented Feb 14, 2022

View reviewed changes

shralex commented Feb 17, 2022

View reviewed changes

nvb reviewed Feb 17, 2022

View reviewed changes

nvb approved these changes Feb 17, 2022

View reviewed changes

shralex commented Feb 18, 2022

View reviewed changes

aayushshah15 approved these changes Feb 18, 2022

View reviewed changes

shralex commented Feb 18, 2022

View reviewed changes

aayushshah15 approved these changes Feb 18, 2022

View reviewed changes

cockroach-teamcity mentioned this pull request Feb 19, 2022

kvserver: lease transfer in JOINT configuration cockroachdb/docs#13074

Closed

erikgrinaker mentioned this pull request Feb 20, 2022

roachtest: kv50/rangelookups/relocate/nodes=8 failed [removing leaseholder not allowed] #76813

Closed

adityamaru mentioned this pull request Feb 21, 2022

roachtest: restoreTPCCInc/nodes=10 failed #76840

Closed

erikgrinaker mentioned this pull request Feb 21, 2022

ccl/importccl: TestMultiNodeExportStmt failed #76809

Closed

This was referenced Feb 22, 2022

kvserver: add new errors to IsExpectedRelocateError #76891

Merged

kvserver: consider VOTER_INCOMING for lease transfer even if current … #76974

Merged

msbutler mentioned this pull request Feb 25, 2022

ccl/backupccl: TestBackupRestoreDataDriven failed #76808

Closed

msbutler mentioned this pull request Mar 9, 2022

ccl/backupccl: TestBackupRestoreDataDriven failed #77264

Closed

shralex mentioned this pull request Mar 15, 2022

kvserver: transfer lease in JOINT config directly to a VOTER_INCOMING… #77841

Merged

shralex mentioned this pull request Mar 26, 2022

release-22.1: Backport of 77841 and 78438 (lease transfer in joint config) #78502

Merged

ajwerner mentioned this pull request Mar 28, 2022

rowexec: TestUncertaintyErrorIsReturned is flakey #78610

Closed

nvb mentioned this pull request Feb 8, 2023

backupccl: restore fails when issuing an AdminSplit #96746

Closed

This was referenced May 16, 2024

kvserver: scatter with rf=1 doesn't randomize replica placement #124171

Closed

kvserver: rebalance ranges with one voter using joint configurations #124284

Merged

nvb mentioned this pull request May 20, 2024

storage: allow removing leaseholder in ChangeReplicas #40333

Closed

blathers-crl bot mentioned this pull request May 20, 2024

release-24.1: kvserver: rebalance ranges with one voter using joint configurations #124453

Merged

This was referenced May 21, 2024

release-23.2: kvserver: rebalance ranges with one voter using joint configurations #124487

Merged

release-23.1: kvserver: rebalance ranges with one voter using joint configurations #124500

Merged

	if len(ops) == 0 {
	if len(voterTargets) > 0 {
	// NB: we may need to transfer even if there are no ops, to make
	// sure the attempt is made to make the first target the final
	// leaseholder. If there are ops, lease transfer will happen during joint config exit,
	// if leader is removed.
	if err := transferLease(voterTargets[0]); err != nil {
	return rangeDesc, err
	}
	}
	return rangeDesc, ctx.Err()
	}

Conversation

shralex commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 20, 2021

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Dec 21, 2021

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 commented Dec 21, 2021

Uh oh!

aayushshah15 commented Dec 21, 2021

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

shralex commented Feb 18, 2022

Uh oh!

craig bot commented Feb 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shralex commented Dec 20, 2021 •

edited

Loading