kvserver: stop transferring leases to replicas that may need snapshots#69696
Conversation
6fc38bd to
df44b32
Compare
nvb
left a comment
There was a problem hiding this comment.
This approach seems very reasonable.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
ctx context.Context, conf roachpb.SpanConfig, leaseRepl *Replica,
It would be nice to test this behavior. To do that, you might want to replace the *Replica type with an interface. Even an inline interface like we have in findRemoveVoter would be enough to make this easy to test in unit tests.
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
ctx context.Context, conf roachpb.SpanConfig, leaseRepl *Replica,
nit: move this next to leaseStoreID.
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
ctx context.Context, conf roachpb.SpanConfig, leaseRepl *Replica,
nit: add an // optional comment.
pkg/kv/kvserver/allocator.go, line 1787 at r1 (raw file):
// a snapshot iff it is in `StateReplicate`. However, even this is racey // because we can still possibly have an ill-timed log truncation between // when we make this determination and when we act on it.
Could you add to this comment why we don't make this as strict as replicaIsBehind? It has to do with needing to transfer leases to lagging followers in some cases, right? Didn't we have an issue about that at one point that led to the removal of this protection?
Edit: here's the commit that added the protection 83128c9, and here's the commit that removed the protection 371738c.
pkg/kv/kvserver/allocator.go, line 1802 at r1 (raw file):
raftStatus *raft.Status, replicas []roachpb.ReplicaDescriptor, ) (result []roachpb.ReplicaDescriptor) { result = make([]roachpb.ReplicaDescriptor, 0, len(replicas))
nit: Does this need to re-allocate? Can't we re-use the provided slice?
pkg/kv/kvserver/allocator.go, line 1807 at r1 (raw file):
if !replicaMayNeedSnapshot(raftStatus, repl.ReplicaID) { result = append(result, repl) }
Do you think it's worth logging on an else branch when we filter out a candidate?
aedd24d to
8e136af
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It would be nice to test this behavior. To do that, you might want to replace the
*Replicatype with an interface. Even an inline interface like we have infindRemoveVoterwould be enough to make this easy to test in unit tests.
thanks for the suggestion, done.
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move this next to
leaseStoreID.
done.
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: add an
// optionalcomment.
done.
pkg/kv/kvserver/allocator.go, line 1787 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add to this comment why we don't make this as strict as
replicaIsBehind? It has to do with needing to transfer leases to lagging followers in some cases, right? Didn't we have an issue about that at one point that led to the removal of this protection?Edit: here's the commit that added the protection 83128c9, and here's the commit that removed the protection 371738c.
Yeah, I found it a bit odd how we removed the protection from the replicateQueue but did not from the StoreRebalancer. The only rationale I can think of is that we'd likely hurt the range's QPS in the short term by transferring it to a lagging replica.
pkg/kv/kvserver/allocator.go, line 1802 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: Does this need to re-allocate? Can't we re-use the provided slice?
done.
pkg/kv/kvserver/allocator.go, line 1807 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you think it's worth logging on an else branch when we filter out a candidate?
done.
8e136af to
25835a8
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
done.
I'm wondering whether it's worth keeping this optional just for the sake of a few tests. Allowing an input to be optional leads to more branching and harder code to reason about. And if we're going to change that, we might as well consolidate these two args to make it clear that they are in reference to the same object.
What do you think about replacing this with:
leaseRepl interface {
StoreID() roachpb.StoreID
RaftStatus() *raft.Status
},Then in tests, you can pass some mock like:
type mockLeaseRepl roachpb.StoreID
func (m mockLeaseRepl) StoreID() roachpb.StoreID { return roachpb.StoreID(m) }
func (m mockLeaseRepl) RaftStatus() *raft.Status { return nil }EDIT: or something like the mock you already added below.
pkg/kv/kvserver/allocator.go, line 1802 at r3 (raw file):
ctx context.Context, raftStatus *raft.Status, replicas []roachpb.ReplicaDescriptor, ) []roachpb.ReplicaDescriptor { var filled = 0
nit: filled := 0 is slightly more idiomatic.
pkg/kv/kvserver/allocator.go, line 1814 at r3 (raw file):
continue } replicas[filled] = replicas[i]
Why do you need i or replicas[i]? Isn't replicas[filled] = repl enough?
pkg/kv/kvserver/allocator_test.go, line 1387 at r3 (raw file):
} type mockRepl struct {
Let's give this a comment hinting at the fact that this is a mock to satisfy the interface of TransferLeaseTarget.
pkg/kv/kvserver/allocator_test.go, line 1389 at r3 (raw file):
type mockRepl struct { replicationFactor uint64 replsInNeedOfSnapshot atomic.Value
Why the use of atomics here? Isn't the new test single-threaded? If this was just a []roachpb.ReplicaID, the code below would be simpler.
25835a8 to
b4fab78
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm wondering whether it's worth keeping this optional just for the sake of a few tests. Allowing an input to be optional leads to more branching and harder code to reason about. And if we're going to change that, we might as well consolidate these two args to make it clear that they are in reference to the same object.
What do you think about replacing this with:
leaseRepl interface { StoreID() roachpb.StoreID RaftStatus() *raft.Status },Then in tests, you can pass some mock like:
type mockLeaseRepl roachpb.StoreID func (m mockLeaseRepl) StoreID() roachpb.StoreID { return roachpb.StoreID(m) } func (m mockLeaseRepl) RaftStatus() *raft.Status { return nil }EDIT: or something like the mock you already added below.
done.
pkg/kv/kvserver/allocator.go, line 1802 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
filled := 0is slightly more idiomatic.
done.
pkg/kv/kvserver/allocator.go, line 1814 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why do you need
iorreplicas[i]? Isn'treplicas[filled] = replenough?
no reason, done.
pkg/kv/kvserver/allocator_test.go, line 1387 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's give this a comment hinting at the fact that this is a mock to satisfy the interface of
TransferLeaseTarget.
done.
pkg/kv/kvserver/allocator_test.go, line 1389 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why the use of atomics here? Isn't the new test single-threaded? If this was just a
[]roachpb.ReplicaID, the code below would be simpler.
you're right, done.
b4fab78 to
be00815
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
but get @nvanbenschoten's take as well.
Reviewed 2 of 3 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):
existing, _ = a.storePool.liveAndDeadReplicas(existing, false /* includeSuspectStores */) // Only proceed with the lease transfer if we are also the raft leader (we
Are there any meaningful failure scenarios where we may need to transfer the lease without being able to get Raft leadership? I can't think of any off the bat, but this seems like the most risky aspect of this change.
pkg/kv/kvserver/allocator.go, line 1805 at r5 (raw file):
"not considering [n%d, s%d] as a potential candidate for a lease transfer"+ " because the replica may be waiting for a snapshot",
Maybe we want to vary this message depending on whether we're the leader or not? I could see it being confusing to see this message when in fact the problem is that we're not the leader, and the replica is replicating just fine.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator_test.go, line 1335 at r5 (raw file):
} for i := uint64(1); i <= r.replicationFactor; i++ { raftStatus.Progress[i] = tracker.Progress{State: tracker.StateReplicate}
nit: this would be easier to read if you switched replsInNeedOfSnapshot to a set (map[roachpb.ReplicaID]struct{}) and wrote the logic here as:
for i := uint64(1); i <= r.replicationFactor; i++ {
state := tracker.StateReplicate
if _, ok := r.replsInNeedOfSnapshot[roachpb.ReplicaID(i)]; ok {
state = tracker.StateSnapshot
}
raftStatus.Progress[i] = tracker.Progress{State: state}
}A linear flow like this avoids mutability, which makes it easier to reason about and more immediately obvious what effect the presence of a replica in the replsInNeedOfSnapshot has.
be00815 to
7a58202
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Are there any meaningful failure scenarios where we may need to transfer the lease without being able to get Raft leadership? I can't think of any off the bat, but this seems like the most risky aspect of this change.
maybeTransferRaftLeadershipToLeaseholderLocked() sends a MsgTransferLeader to the range's raft group (and is called on every raft tick), so my understanding is that if we're persistently in a scenario where leaseholdership and leadership are uncollocated, it must imply that the underlying raft group is actually wedged. In such cases, the range should already be kaput and any lease transfer should also just fail / get stuck.
Does this line up with your / @nvanbenschoten's intuition?
pkg/kv/kvserver/allocator.go, line 1805 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
"not considering [n%d, s%d] as a potential candidate for a lease transfer"+ " because the replica may be waiting for a snapshot",Maybe we want to vary this message depending on whether we're the leader or not? I could see it being confusing to see this message when in fact the problem is that we're not the leader, and the replica is replicating just fine.
Done.
pkg/kv/kvserver/allocator_test.go, line 1335 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: this would be easier to read if you switched
replsInNeedOfSnapshotto a set (map[roachpb.ReplicaID]struct{}) and wrote the logic here as:for i := uint64(1); i <= r.replicationFactor; i++ { state := tracker.StateReplicate if _, ok := r.replsInNeedOfSnapshot[roachpb.ReplicaID(i)]; ok { state = tracker.StateSnapshot } raftStatus.Progress[i] = tracker.Progress{State: state} }A linear flow like this avoids mutability, which makes it easier to reason about and more immediately obvious what effect the presence of a replica in the
replsInNeedOfSnapshothas.
done.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
maybeTransferRaftLeadershipToLeaseholderLocked()sends aMsgTransferLeaderto the range's raft group (and is called on every raft tick), so my understanding is that if we're persistently in a scenario where leaseholdership and leadership are uncollocated, it must imply that the underlying raft group is actually wedged. In such cases, the range should already be kaput and any lease transfer should also just fail / get stuck.Does this line up with your / @nvanbenschoten's intuition?
Yes, this matches my intuition. maybeTransferRaftLeadershipToLeaseholderLocked does require the leaseholder to be caught up on the Raft log to be able to assume leadership, but I can't see that causing the kinds of problems we saw in 371738c, because the leaseholder is the only replica able to propose (a serious amount of) writes. So it seems unlikely, even under sustained load, that it would be persistently behind on the log.
But I think it's worth posing this question to the #kv slack channel, just in case anyone spots anything that we're missing. Basically just whether anyone notices any undesirable consequences of creating this dependency and preventing follower replicas that hold a range lease from transferring that lease away.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, this matches my intuition.
maybeTransferRaftLeadershipToLeaseholderLockeddoes require the leaseholder to be caught up on the Raft log to be able to assume leadership, but I can't see that causing the kinds of problems we saw in 371738c, because the leaseholder is the only replica able to propose (a serious amount of) writes. So it seems unlikely, even under sustained load, that it would be persistently behind on the log.But I think it's worth posing this question to the #kv slack channel, just in case anyone spots anything that we're missing. Basically just whether anyone notices any undesirable consequences of creating this dependency and preventing follower replicas that hold a range lease from transferring that lease away.
Yes, that's how I understand it as well, and I can't think of any meaningful exceptions. This is probably the sort of problem we'll have to rely on randomized tests to catch, but worth asking around as @nvanbenschoten suggests.
This commit disallows the `replicateQueue` from initiating lease transfers to replicas that may be in need of a raft snapshot. Note that the `StoreRebalancer` already has a stronger form of this check since it disallows lease transfers to replicas that are lagging behind the raft leader (which includes the set of replicas that need a snapshot). In cases where the raft leader is not the leaseholder, we disallow the replicateQueue from any sort of lease transfer until leaseholdership and leadership are collocated. We rely on calls to `maybeTransferRaftLeadershipToLeaseholderLocked()` (called on every raft tick) to make sure that such periods of leadership / leaseholdership misalignment are ephemeral and rare. Release justification: bug fix Release note (bug fix): Fixes a bug that can cause prolonged unavailability due to lease transfer to a replica that may be in need of a raft snapshot.
7a58202 to
b344fb8
Compare
|
TFTRs I'll merge on CI green. @erikgrinaker pointed out that the SHA selection for 20.2.16 happens off of the state of the release branch on September 6. I suggest that we immediately merge the backport PRs for this patch in order to get a few days' worth of nightly test coverage on those release branches before the SHA selection. |
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from b344fb8 to blathers/backport-release-20.2-69696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 20.2.x failed. See errors above. error creating merge commit from b344fb8 to blathers/backport-release-21.1-69696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This commit disallows the
replicateQueuefrom initiating leasetransfers to replicas that may be in need of a raft snapshot. Note that
the
StoreRebalanceralready has a stronger form of this check since itdisallows lease transfers to replicas that are lagging behind the raft
leader (which includes the set of replicas that need a snapshot).
In cases where the raft leader is not the leaseholder, we disallow the
replicateQueue from any sort of lease transfer until leaseholdership and
leadership are collocated. We rely on calls to
maybeTransferRaftLeadershipToLeaseholderLocked()(called on every rafttick) to make sure that such periods of leadership / leaseholdership
misalignment are ephemeral and rare.
Alternative to #63507
Release justification: bug fix
Resolves #61604
Release note (bug fix): Fixes a bug that can cause prolonged
unavailability due to lease transfer to a replica that may be in need of
a raft snapshot.