kvserver: lease transfer in JOINT configuration#74077
kvserver: lease transfer in JOINT configuration#74077craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
304a581 to
df65c65
Compare
2a227eb to
527263a
Compare
nvb
left a comment
There was a problem hiding this comment.
It's exciting to see this! When you get a chance, let's try out the repro steps from #67740 (comment) and see how things behave with this change.
@tbg should probably also give this a pass when it gets closer, as he was the author of a lot of this joint configuration code. I also left a comment for him in replica_raft.go.
Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @shralex)
-- commits, line 14 at r1:
nit: s/a/A/
-- commits, line 29 at r1:
Consider adding a release note to this commit. It's a valuable change that users will care about!
pkg/kv/db.go, line 629 at r1 (raw file):
err := db.Run(ctx, b) if err == nil { log.Infof(ctx, "Transferring lease to StoreID %s succeeded", target.String())
Was this just added for debugging purposes, or did you intend for us to keep it?
pkg/kv/kvserver/replica_command.go, line 985 at r1 (raw file):
details string, chgs roachpb.ReplicationChanges, repl *Replica,
Isn't this already the method receiver?
pkg/kv/kvserver/replica_command.go, line 1172 at r1 (raw file):
// NON_VOTER, and VOTER_FULL only. func maybeLeaveAtomicChangeReplicas( ctx context.Context, s *Store, desc *roachpb.RangeDescriptor, r *Replica,
Instead of adding this as a parameter, what do you think about making maybeLeaveAtomicChangeReplicas and maybeLeaveAtomicChangeReplicasAndRemoveLearners methods on *Replica? That way, we can actually get rid of all arguments except for the context.
pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):
// with leaving a joint state when it's in one, so make sure we don't call // it if we're not. if !desc.Replicas().InAtomicReplicationChange(){
It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.
pkg/kv/kvserver/replica_command.go, line 1184 at r1 (raw file):
voters := desc.Replicas().VoterDescriptors() if r != nil {
The fact that we can ignore this check seems like a source of bugs. The need for this is coming from the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners on the RHS range in mergeQueue.process, right? Did you run into issues with the proposed idea to remove that call entirely and enter the AdminRelocateRange branch when the RHS is in a joint config instead?
pkg/kv/kvserver/replica_command.go, line 1185 at r1 (raw file):
voters := desc.Replicas().VoterDescriptors() if r != nil { // Current leaseholder is being removed
Let's add more to this comment. We can discuss why we're looking in the voters slice, what it means for the current replica (which we know to be the leaseholder) to be absent from it, and what we will do in that case (transfer the lease).
pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):
transferLeaseOptions{ goal: followTheWorkload, checkTransferLeaseSource: true,
Consider adding a comment about why we set this to true. Something about having already filtered out the current leaseholder.
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
checkTransferLeaseSource: true, checkCandidateFullness: false, disableLeaseCountConvergenceChecks: true,
Could you add a comment about why we set this? I don't understand the full flow in Allocator.TransferLeaseTarget, but it's interesting to me that we have to set this here, but not elsewhere. For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway. Should we be doing so? @aayushshah15 do you know what this is about?
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
if target == (roachpb.ReplicaDescriptor{}) { log.Infof(ctx, "could not find a better lease transfer target for r%d", desc.RangeID)
Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.
Also, regardless of whether we log (log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.
pkg/kv/kvserver/replica_command.go, line 1213 at r1 (raw file):
desc.RangeID) } else { log.Infof(ctx, "current leaseholder %v is being removed. Transferring lease to %v",
minor nit: s/. T/, t/
Also, s/is being removed/is being removed through an atomic replication change/
pkg/kv/kvserver/replica_command.go, line 2787 at r1 (raw file):
} ops, err := s.relocateOne(ctx, &rangeDesc, voterTargets, nonVoterTargets)
I'm surprised we didn't need to revert this part of the change because of single-step config changes. What will happen if I have a config that looks like {1 [L], 2, 3} and I issue an AdminRelocateRange that drops the config down to {2 [L], 3}?
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
prefix = false // The following deals with removing a leaseholder. A voter can be removed in two ways.
Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.
Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn directly. @tbg are you aware of any tests at this level?
EDIT: maybe we can do something around TestLeaseHolderRemoveSelf to test this?
pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):
// to disregard the existing lease counts on candidates. checkCandidateFullness bool // when true, ignores lease count convergence considerations
nit: missing punctuation.
pkg/roachpb/metadata.go, line 488 at r1 (raw file):
} func (r ReplicaDescriptor) IsAnyVoter() bool {
This will need a comment. It will probably look like the two above.
pkg/roachpb/metadata_replicas.go, line 527 at r1 (raw file):
if !ok { return errReplicaNotFound } else if !repDesc.IsVoterNewConfig() {
This should probably also go in its own commit/PR, with some testing, for the same reasons as listed above.
pkg/testutils/testcluster/testcluster.go, line 886 at r1 (raw file):
) { if err := tc.TransferRangeLease(rangeDesc, dest); err != nil { t.Fatalf(`couldn not transfer lease for range %s error is %+v`, rangeDesc, err)
"could not" or "couldn't"
pkg/testutils/testcluster/testcluster.go, line 897 at r1 (raw file):
) { testutils.SucceedsSoon(t, func() error { if _, err := tc.RemoveVoters(rangeDesc.StartKey.AsRawKey(), src); err != nil {
I'm surprised this works. Are we entering a joint config here?
pkg/kv/kvserver/client_lease_test.go, line 640 at r1 (raw file):
tc.TransferRangeLeaseOrFatal(t, rhsDesc, tc.Target(0)) // Check that the lease is on 1
nit: punctuation needed at the end of a lot of these comments.
pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):
require.Equal(t, tc.Target(2), leaseHolder) gossipLiveness(t, tc)
Why is this needed?
pkg/kv/kvserver/client_lease_test.go, line 657 at r1 (raw file):
// Relocate range 3 -> 4. _, err = db.Exec("ALTER RANGE " + rhsDesc.RangeID.String() + " RELOCATE FROM 3 TO 4")
Isn't not very common to see SQL used in these tests to perform this kind of change. It's more common to see kv.DB.AdminChangeReplicas used, or even better, one of the TestCluster helper methods. Maybe we need a new helper method?
pkg/kv/kvserver/client_lease_test.go, line 665 at r1 (raw file):
require.Equal(t, tc.Target(3), leaseHolder) // Double check that lease moved directly 3 -> 4
This is neat!
pkg/kv/kvserver/client_raft_test.go, line 3803 at r1 (raw file):
}
nit: stray line. That may go away with a gofmt.
pkg/kv/kvserver/client_relocate_range_test.go, line 239 at r1 (raw file):
}) }
tiny nit: based on the rest of this test, it seems like we wanted this new line for consistency.
|
I remembered another comment I meant to leave here. We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1. |
tbg
left a comment
There was a problem hiding this comment.
Gave a shallow pass, generally looks good. Sad that all the messy lease stuff is now intertwining with the replication changes but I suppose that's what we signed up for. Still, if there's a way to clean that up I would be in favor of that. Maybe all of the lease handling doesn't have to be inline but can be pulled out into something compact that we can call in the right place.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @shralex)
pkg/kv/db.go, line 629 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Was this just added for debugging purposes, or did you intend for us to keep it?
These should go away, we shouldn't be logging in *kv.DB. If anything, these logs would be expected at the server that executes the actual transfer.
Also, general nits:
- log messages start lowercase
- don't need to call
target.String, justtargetshould do fine (%sverb invokesString()) - we usually refer to stores as
s%din logging
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
fallthrough case decideWithoutStats: if !opts.disableLeaseCountConvergenceChecks &&
was this just missing & you noticed while writing this PR?
pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):
if err != nil { return nil, err }
Are we sure the lease is now where it should be? There's obviously the case where for some reason the lease gets transferred somewhere else by a third party. Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below? Similarly, are there persistent error cases where the lease transfer (or one of the steps above) fail? I assume there are no new concerns here since we were never previously able to remove the leaseholder and so nothing has really changed. But I am vaguely worried that what we do now isn't 100% equivalent and that we need a fail-safe (which the migration forces us to have anyway).
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.
Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes
execChangeReplicasTxndirectly. @tbg are you aware of any tests at this level?EDIT: maybe we can do something around
TestLeaseHolderRemoveSelfto test this?
I don't think there's direct testing of this, unfortunately. Perhaps a reasonable way to get what we want is by inserting testing knobs between the steps that we can use to check on the replication change at various stages.
pkg/kv/kvserver/replica_raft.go, line 323 at r1 (raw file):
// and then transfer lease directly to v4, but this would change the number of replicas to 4, // and if region1 goes down, we loose a quorum. Instead, we move to a joint config where v1 // (VOTER_DEMOTED_LEARNER) transfer the lease to v4 (VOTER_INCOMING) directly.
DEMOTING
pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: missing punctuation.
capitalization too and usually (though redundantly) the comment would say
// disableLeaseCountConvergenceChecks, when true, ...
pkg/kv/kvserver/client_raft_test.go, line 3826 at r1 (raw file):
require.NoError(t, err) // Check that lease moved to server 2
.
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway
I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks as its currently written makes it such that we're forced to use followTheWorkload or return an empty result.
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.
Also, regardless of whether we log (
log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.
+1 I'm not sure we need to call into TransferLeaseTarget at all here. Zone configuration lease preferences must be compatible with the constraints placed on the range. So we should never be trying to replace the current leaseholder with another replica that is in violation of those lease preferences. Given all this, I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder.
Another thing to note is that part of AdminRelocateRange's contract is that it must transfer the lease to the first voting replica in the input slice. This is done here at the end:
cockroach/pkg/kv/kvserver/replica_command.go
Lines 2791 to 2802 in 527263a
TransferLeaseTarget for me.
pkg/kv/kvserver/replica_command.go, line 2795 at r1 (raw file):
// NB: we may need to transfer even if there are no ops, to make // sure the attempt is made to make the first target the final // leaseholder. If there are ops, lease transfer will happen during joint config exit,
nit: this addition is a bit out of place here, and wrapping.
pkg/kv/kvserver/replica_command.go, line 2796 at r1 (raw file):
// sure the attempt is made to make the first target the final // leaseholder. If there are ops, lease transfer will happen during joint config exit, // if leader is removed.
s/leader/leaseholder
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
prefix = false // The following deals with removing a leaseholder. A voter can be removed in two ways.
note that comments should be wrapped at 80 chars (see: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/181371303/Go+Golang+coding+guidelines#Line-Length) here and elsewhere.
To add to this, we will also still need the removed allocator logic under mixed-version states. |
|
pkg/kv/kvserver/replicate_queue.go, line 543 at r1 (raw file):
This call should also probably go away. |
527263a to
4182acc
Compare
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding a release note to this commit. It's a valuable change that users will care about!
Sure, I'll add before this is merged, once we converge.
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
was this just missing & you noticed while writing this PR?
I reverted this change, as the same can be achieved by setting checkTransferLeaseSource to false
pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.
thanks! I set it up previously but it somehow disappeared...
pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding a comment about why we set this to
true. Something about having already filtered out the current leaseholder.
actually I think it should be false because of the lease count checks in the TransferLeaseTarget method
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway
I dont fully follow why we need this new option.
disableLeaseCountConvergenceChecksas its currently written makes it such that we're forced to usefollowTheWorkloador return an empty result.
What I was observing is that when the lease counts are roughly similar, this method would return an empty result. I now removed this new option, since checkTransferLeaseSource=false would achieve what I needed here.
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Updated this to return an error.
I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder
It might be better to make this code a bit more general, such that it works even if there are multiple incoming nodes, or none. Also, there are cases where the joining node is not the best choice. Running the selection logic addresses both issues.
pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this needed?
Without this, the liveness status of candidate nodes is unknown, and they are excluded from consideration when we're trying to decide where to transfer the lease.
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards cockroachdb#74077 Release note: None
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards cockroachdb#74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
74479: backupccl: replace memory accumulator with bound account r=dt a=adityamaru This change replaces the memory accumulator with a raw bound account. The memory accumulator provides a threadsafe wrapper around the bound account, and some redundant resource pooling semantics. Both of these are currently not needed by the sstSink that is being memory monitored. This change deletes the memory accumulator. Release note: None 74546: kvserver: allow VOTER_INCOMING to receive the lease r=shralex a=shralex Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None 74549: kvserver: determine if leaseholder is removed using proposed replica … r=shralex a=shralex …descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: shralex <shralex@gmail.com>
nvb
left a comment
There was a problem hiding this comment.
Reviewed all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):
Previously, shralex wrote…
I agree we should investigate and improve this, but perhaps as a follow-up patch ? since this seems like a separate (or at least separable) issue.
I'm usually very pro-follow-up patch, but in this case, I think the flakiness is concerning enough to try to understand now. It may hint at a race that's going to make all of this unstable in practice. For instance, if we're not accounting for the learner snapshot when determining whether the voter_incoming can become the leaseholder, we may frequently roll back the entire config change. Learning a bit more about what's going on would help de-risk this change. If it is a testing-only flake then we can save the deflaking for a follow-up patch.
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/replica_command.go, line 2764 at r9 (raw file):
Previously, shralex wrote…
Done.
Done.
pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm usually very pro-follow-up patch, but in this case, I think the flakiness is concerning enough to try to understand now. It may hint at a race that's going to make all of this unstable in practice. For instance, if we're not accounting for the learner snapshot when determining whether the voter_incoming can become the leaseholder, we may frequently roll back the entire config change. Learning a bit more about what's going on would help de-risk this change. If it is a testing-only flake then we can save the deflaking for a follow-up patch.
Sounds good
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):
Previously, shralex wrote…
Sounds good
Raft progress, which we use to determine whether the replica may need a snapshot, is managed by Raft, which expects one more log update after the snapshot before marking progress as "replicating". Discussed this with @tbg and we decided to go with a simpler solution - to consider any replica in the VOTER_INCOMING state as one that we shouldn't filter out based on "may need a snapshot", since we only move to that state after it got a snapshot.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r10, 12 of 12 files at r11, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/allocator.go, line 1948 at r11 (raw file):
func replicaMayNeedSnapshot(raftStatus *raft.Status, replica roachpb.ReplicaDescriptor) bool { // When adding replicas, we only move them from LEARNER to VOTER_INCOMING after // they applied the snapshot
Do you mind pointing at the code that does this: initializeRaftLearners.
also, nit: punctuation.
pkg/kv/kvserver/replica_command.go, line 3101 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's somewhat disconcerting to see so many calls to
IsActive(EnableLeaseHolderRemoval). The cluster version state is an atomic that can be changed at any time, so each of these calls needs to be the one that's prepared to see the active status switch fromfalsetotrue. Furthermore, the cluster version is monotonic on a single node, but different nodes can see this active status switch at different times. Are we confident that we're handling things correctly in each case?As a general design principle with version upgrades like this, it's helpful to try to find a single "gate" where the version can be checked. This eliminates the number of states that the system can be in midway through a version upgrade. With this change, we may need two:
- a best-effort, graceful one in the
replicateQueueto decide whether to attempt a leaseholder-removing replication change or whether to use the old logic- a strict one in
Replica.proposeto decide whether a leaseholder can enter a joint config that removes itself. This one is currently missing.I guess we may also need one more somewhere in
relocateReplicasto decide whether to attempt a leaseholder-removing replication change or whether to use the old logic.
I think this is my last remaining comment!
pkg/sql/logictest/logic.go, line 1449 at r9 (raw file):
Previously, shralex wrote…
Raft progress, which we use to determine whether the replica may need a snapshot, is managed by Raft, which expects one more log update after the snapshot before marking progress as "replicating". Discussed this with @tbg and we decided to go with a simpler solution - to consider any replica in the VOTER_INCOMING state as one that we shouldn't filter out based on "may need a snapshot", since we only move to that state after it got a snapshot.
That's a good idea. Glad we found a reasonable way to do this.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Do we understand the case(s) where
TransferLeaseTargetwould ever return an empty target whencheckTransferLeaseSource = false? If it is somehow possible forTransferLeaseTargetto produce an empty target, then that seems like something we need to address because as Nathan was saying above, this means that the range would potentially never be able to get out of its joint state. We should have alog.Fatal()if this returns an empty target.It looks like
TransferLeaseTargetcould return an empty target if all the other replicas of the range are onsuspectordrainingstores. Thesuspectstate shouldn't be permanent (though it could last for a while), and thedrainingstate is essentially permanent unless the node is restarted. We could perhaps fix this by having the caller askTransferLeaseTargetto consider all candidate stores (i.e. usestoreFilterNoneinstead ofstoreFilterSuspect).Also, what do you think about lifting this call to
TransferLeaseTargetinto an appropriately named helper that has comments outlining why we need to ensure that it produces a different lease target?
This method is pretty complex and depends on many flags, so my confidence isn't very high. But with the current set of flags I'm providing (which I picked to do the least possible filtering of candidates), two cases where an empty set could still be returned seem to be suspect/unknown/draining nodes and nodes needing a snapshot. I added another test to make sure that if we have a preferred region but nodes aren't available there, this would not stop the lease from transferring somewhere. I also made sure that incoming nodes aren't filtered due to the snapshot heuristic. As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.
pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below?
I think the way this should work is that lease transfers to a voter in the process of being removed (
VOTER_DEMOTINGorVOTER_OUTGOING) should be disallowed, so once we move the lease away, nothing should be able to move it back.
Right, this check is done in CheckCanReceiveLease
aayushshah15
left a comment
There was a problem hiding this comment.
except for that one comment. It's exciting to see this land!
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.
A range not being able to exit the joint config will wedge splits, merges, and all rebalancing on the range (including load-based rebalancing). I'm particularly worried about wedging rebalancing on the range since that affects our ability to get the range off of suspect stores.
Maybe its all fine but it adds an additional consideration that should at least be properly documented here. Did we consider pulling this call out into a helper that documents this discussion?
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
As for suspect/unknown/draining nodes, it seems to me that it is better to stay in the joint config until this is fixed for some node then transfer the lease to a node in this state, since this specific node isn't guaranteed to recover, and then we'd be waiting for re-election.
A range not being able to exit the joint config will wedge splits, merges, and all rebalancing on the range (including load-based rebalancing). I'm particularly worried about wedging rebalancing on the range since that affects our ability to get the range off of suspect stores.
Maybe its all fine but it adds an additional consideration that should at least be properly documented here. Did we consider pulling this call out into a helper that documents this discussion?
This seems like an inherent risk with this whole approach. I pulled this code into a separate method, please let me know if this is what you had in mind. One thing we could perhaps do is call TransferLeaseTarget before entering the joint config to see if there are suitable candidates and not enter the config if there aren't. It does not guarantee that we'll be able to exit, but perhaps reduces the vulnerability. wdyt ? if you like this idea, any suggestion where to make this call ?
Previously: 1. Removing a leaseholder was not allowed. 2. A VOTER_INCOMING node wasn't able to accept the lease. Because of (1), users needed to transfer the lease before removing the leaseholder. Because of (2), when relocating a range from the leaseholder A to a new node B, there was no possibility to transfer the lease to B before it was fully added as VOTER. Adding it as a voter, however, could degrade fault tolerance. For example, if A and B are in region R1, C in region R2 and D in R3, and we had (A, C, D), and now adding B to the cluster to replace A results in the intermediate configuration (A, B, C, D) the failure of R1 would make the cluster unavailable since no quorum can be established. Since B can't be added before A is removed, the system would transfer the lease out to C, remove A and add B, and then transfer the lease again to B. This resulted a temporary migration of leases out of their preferred region, imbalance of lease count and degraded performance. The PR fixes this, by (1) allowing removing the leaseholder, and transferring the lease right before we exit the JOINT config. And (2), allowing a VOTER_INCOMING to accept the lease. Release note (performance improvement): Fixes a limitation which meant that, upon adding a new node to the cluster, lease counts among existing nodes could diverge until the new node was fully upreplicated.
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Previously, shralex wrote…
This seems like an inherent risk with this whole approach. I pulled this code into a separate method, please let me know if this is what you had in mind. One thing we could perhaps do is call TransferLeaseTarget before entering the joint config to see if there are suitable candidates and not enter the config if there aren't. It does not guarantee that we'll be able to exit, but perhaps reduces the vulnerability. wdyt ? if you like this idea, any suggestion where to make this call ?
Thanks, that is what I had in mind. We could do what you're suggesting (by faking the new incoming replica and passing that into TransferLeaseTarget), if so I'd suggest doing it where we're checking if we're removing the leaseholder before creating the learner replicas (and if so, if the version gate allows us to do so). We could also do that as a follow up patch.
|
bors r+ |
|
Build succeeded: |
Previously:
Because of (1), users needed to transfer the lease before removing
the leaseholder. Because of (2), when relocating a range from the
leaseholder A to a new node B, there was no possibility to transfer
the lease to B before it was fully added as VOTER. Adding it as a
voter, however, could degrade fault tolerance. For example, if A
and B are in region R1, C in region R2 and D in R3, and we had
(A, C, D), and now adding B to the cluster to replace A results in
the intermediate configuration (A, B, C, D) the failure of R1 would
make the cluster unavailable since no quorum can be established.
Since B can't be added before A is removed, the system would
transfer the lease out to C, remove A and add B, and then transfer
the lease again to B. This resulted a temporary migration of leases
out of their preferred region, imbalance of lease count and degraded
performance.
The PR fixes this, by (1) allowing removing the leaseholder, and
transferring the lease right before we exit the JOINT config. And (2),
allowing a VOTER_INCOMING to accept the lease.
Release note (performance improvement): Fixes a limitation which meant
that, upon adding a new node to the cluster, lease counts among existing
nodes could diverge until the new node was fully upreplicated.
Here are a few experiments that demonstrate the benefit of the feature.
1.
Without the change (master):

With the change:

We can see that without the patch the number of leases on server 0 (black line) goes all the way to 0 before it goes back up and that the number of leases in other racks goes up, both undesirable. With the patch both things are no longer happening.
ALTER RANGE default CONFIGURE ZONE USING lease_preferences='[[+rack=0]]';
Without the change (master):

With the change:

We can see that without the change the number of leaseholders in racks 1 and 2 together (not in preferred region) grows from 300 to 1000, then goes back to 40. With the fix it doesn’t grow at all.