Skip to content

kvserver: stop transferring leases to replicas that may need snapshots#69696

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20210901_noLeaseTransfersToStateSnapshot
Sep 2, 2021
Merged

kvserver: stop transferring leases to replicas that may need snapshots#69696
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20210901_noLeaseTransfersToStateSnapshot

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Sep 1, 2021

This commit disallows the replicateQueue from initiating lease
transfers to replicas that may be in need of a raft snapshot. Note that
the StoreRebalancer already has a stronger form of this check since it
disallows lease transfers to replicas that are lagging behind the raft
leader (which includes the set of replicas that need a snapshot).

In cases where the raft leader is not the leaseholder, we disallow the
replicateQueue from any sort of lease transfer until leaseholdership and
leadership are collocated. We rely on calls to
maybeTransferRaftLeadershipToLeaseholderLocked() (called on every raft
tick) to make sure that such periods of leadership / leaseholdership
misalignment are ephemeral and rare.

Alternative to #63507

Release justification: bug fix

Resolves #61604

Release note (bug fix): Fixes a bug that can cause prolonged
unavailability due to lease transfer to a replica that may be in need of
a raft snapshot.

@aayushshah15 aayushshah15 requested a review from a team as a code owner September 1, 2021 16:00
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from 6fc38bd to df44b32 Compare September 1, 2021 16:01
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems very reasonable.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

	ctx context.Context,
	conf roachpb.SpanConfig,
	leaseRepl *Replica,

It would be nice to test this behavior. To do that, you might want to replace the *Replica type with an interface. Even an inline interface like we have in findRemoveVoter would be enough to make this easy to test in unit tests.


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

	ctx context.Context,
	conf roachpb.SpanConfig,
	leaseRepl *Replica,

nit: move this next to leaseStoreID.


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

	ctx context.Context,
	conf roachpb.SpanConfig,
	leaseRepl *Replica,

nit: add an // optional comment.


pkg/kv/kvserver/allocator.go, line 1787 at r1 (raw file):

		// a snapshot iff it is in `StateReplicate`. However, even this is racey
		// because we can still possibly have an ill-timed log truncation between
		// when we make this determination and when we act on it.

Could you add to this comment why we don't make this as strict as replicaIsBehind? It has to do with needing to transfer leases to lagging followers in some cases, right? Didn't we have an issue about that at one point that led to the removal of this protection?

Edit: here's the commit that added the protection 83128c9, and here's the commit that removed the protection 371738c.


pkg/kv/kvserver/allocator.go, line 1802 at r1 (raw file):

	raftStatus *raft.Status, replicas []roachpb.ReplicaDescriptor,
) (result []roachpb.ReplicaDescriptor) {
	result = make([]roachpb.ReplicaDescriptor, 0, len(replicas))

nit: Does this need to re-allocate? Can't we re-use the provided slice?


pkg/kv/kvserver/allocator.go, line 1807 at r1 (raw file):

		if !replicaMayNeedSnapshot(raftStatus, repl.ReplicaID) {
			result = append(result, repl)
		}

Do you think it's worth logging on an else branch when we filter out a candidate?

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch 5 times, most recently from aedd24d to 8e136af Compare September 1, 2021 20:57
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It would be nice to test this behavior. To do that, you might want to replace the *Replica type with an interface. Even an inline interface like we have in findRemoveVoter would be enough to make this easy to test in unit tests.

thanks for the suggestion, done.


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this next to leaseStoreID.

done.


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add an // optional comment.

done.


pkg/kv/kvserver/allocator.go, line 1787 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add to this comment why we don't make this as strict as replicaIsBehind? It has to do with needing to transfer leases to lagging followers in some cases, right? Didn't we have an issue about that at one point that led to the removal of this protection?

Edit: here's the commit that added the protection 83128c9, and here's the commit that removed the protection 371738c.

Yeah, I found it a bit odd how we removed the protection from the replicateQueue but did not from the StoreRebalancer. The only rationale I can think of is that we'd likely hurt the range's QPS in the short term by transferring it to a lagging replica.


pkg/kv/kvserver/allocator.go, line 1802 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Does this need to re-allocate? Can't we re-use the provided slice?

done.


pkg/kv/kvserver/allocator.go, line 1807 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think it's worth logging on an else branch when we filter out a candidate?

done.

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from 8e136af to 25835a8 Compare September 1, 2021 21:10
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

done.

I'm wondering whether it's worth keeping this optional just for the sake of a few tests. Allowing an input to be optional leads to more branching and harder code to reason about. And if we're going to change that, we might as well consolidate these two args to make it clear that they are in reference to the same object.

What do you think about replacing this with:

leaseRepl interface {
    StoreID() roachpb.StoreID
    RaftStatus() *raft.Status
},

Then in tests, you can pass some mock like:

type mockLeaseRepl roachpb.StoreID

func (m mockLeaseRepl) StoreID() roachpb.StoreID { return roachpb.StoreID(m) }
func (m mockLeaseRepl) RaftStatus() *raft.Status { return nil }

EDIT: or something like the mock you already added below.


pkg/kv/kvserver/allocator.go, line 1802 at r3 (raw file):

	ctx context.Context, raftStatus *raft.Status, replicas []roachpb.ReplicaDescriptor,
) []roachpb.ReplicaDescriptor {
	var filled = 0

nit: filled := 0 is slightly more idiomatic.


pkg/kv/kvserver/allocator.go, line 1814 at r3 (raw file):

			continue
		}
		replicas[filled] = replicas[i]

Why do you need i or replicas[i]? Isn't replicas[filled] = repl enough?


pkg/kv/kvserver/allocator_test.go, line 1387 at r3 (raw file):

}

type mockRepl struct {

Let's give this a comment hinting at the fact that this is a mock to satisfy the interface of TransferLeaseTarget.


pkg/kv/kvserver/allocator_test.go, line 1389 at r3 (raw file):

type mockRepl struct {
	replicationFactor     uint64
	replsInNeedOfSnapshot atomic.Value

Why the use of atomics here? Isn't the new test single-threaded? If this was just a []roachpb.ReplicaID, the code below would be simpler.

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from 25835a8 to b4fab78 Compare September 2, 2021 15:00
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1257 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm wondering whether it's worth keeping this optional just for the sake of a few tests. Allowing an input to be optional leads to more branching and harder code to reason about. And if we're going to change that, we might as well consolidate these two args to make it clear that they are in reference to the same object.

What do you think about replacing this with:

leaseRepl interface {
    StoreID() roachpb.StoreID
    RaftStatus() *raft.Status
},

Then in tests, you can pass some mock like:

type mockLeaseRepl roachpb.StoreID

func (m mockLeaseRepl) StoreID() roachpb.StoreID { return roachpb.StoreID(m) }
func (m mockLeaseRepl) RaftStatus() *raft.Status { return nil }

EDIT: or something like the mock you already added below.

done.


pkg/kv/kvserver/allocator.go, line 1802 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: filled := 0 is slightly more idiomatic.

done.


pkg/kv/kvserver/allocator.go, line 1814 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do you need i or replicas[i]? Isn't replicas[filled] = repl enough?

no reason, done.


pkg/kv/kvserver/allocator_test.go, line 1387 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's give this a comment hinting at the fact that this is a mock to satisfy the interface of TransferLeaseTarget.

done.


pkg/kv/kvserver/allocator_test.go, line 1389 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why the use of atomics here? Isn't the new test single-threaded? If this was just a []roachpb.ReplicaID, the code below would be simpler.

you're right, done.

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from b4fab78 to be00815 Compare September 2, 2021 15:56
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but get @nvanbenschoten's take as well.

Reviewed 2 of 3 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):

	existing, _ = a.storePool.liveAndDeadReplicas(existing, false /* includeSuspectStores */)

	// Only proceed with the lease transfer if we are also the raft leader (we

Are there any meaningful failure scenarios where we may need to transfer the lease without being able to get Raft leadership? I can't think of any off the bat, but this seems like the most risky aspect of this change.


pkg/kv/kvserver/allocator.go, line 1805 at r5 (raw file):

				"not considering [n%d, s%d] as a potential candidate for a lease transfer"+
					" because the replica may be waiting for a snapshot",

Maybe we want to vary this message depending on whether we're the leader or not? I could see it being confusing to see this message when in fact the problem is that we're not the leader, and the replica is replicating just fine.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator_test.go, line 1335 at r5 (raw file):

	}
	for i := uint64(1); i <= r.replicationFactor; i++ {
		raftStatus.Progress[i] = tracker.Progress{State: tracker.StateReplicate}

nit: this would be easier to read if you switched replsInNeedOfSnapshot to a set (map[roachpb.ReplicaID]struct{}) and wrote the logic here as:

for i := uint64(1); i <= r.replicationFactor; i++ {
    state := tracker.StateReplicate
    if _, ok := r.replsInNeedOfSnapshot[roachpb.ReplicaID(i)]; ok {
        state = tracker.StateSnapshot
    }
    raftStatus.Progress[i] = tracker.Progress{State: state}
}

A linear flow like this avoids mutability, which makes it easier to reason about and more immediately obvious what effect the presence of a replica in the replsInNeedOfSnapshot has.

@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from be00815 to 7a58202 Compare September 2, 2021 17:42
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Are there any meaningful failure scenarios where we may need to transfer the lease without being able to get Raft leadership? I can't think of any off the bat, but this seems like the most risky aspect of this change.

maybeTransferRaftLeadershipToLeaseholderLocked() sends a MsgTransferLeader to the range's raft group (and is called on every raft tick), so my understanding is that if we're persistently in a scenario where leaseholdership and leadership are uncollocated, it must imply that the underlying raft group is actually wedged. In such cases, the range should already be kaput and any lease transfer should also just fail / get stuck.

Does this line up with your / @nvanbenschoten's intuition?


pkg/kv/kvserver/allocator.go, line 1805 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…
				"not considering [n%d, s%d] as a potential candidate for a lease transfer"+
					" because the replica may be waiting for a snapshot",

Maybe we want to vary this message depending on whether we're the leader or not? I could see it being confusing to see this message when in fact the problem is that we're not the leader, and the replica is replicating just fine.

Done.


pkg/kv/kvserver/allocator_test.go, line 1335 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this would be easier to read if you switched replsInNeedOfSnapshot to a set (map[roachpb.ReplicaID]struct{}) and wrote the logic here as:

for i := uint64(1); i <= r.replicationFactor; i++ {
    state := tracker.StateReplicate
    if _, ok := r.replsInNeedOfSnapshot[roachpb.ReplicaID(i)]; ok {
        state = tracker.StateSnapshot
    }
    raftStatus.Progress[i] = tracker.Progress{State: state}
}

A linear flow like this avoids mutability, which makes it easier to reason about and more immediately obvious what effect the presence of a replica in the replsInNeedOfSnapshot has.

done.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

maybeTransferRaftLeadershipToLeaseholderLocked() sends a MsgTransferLeader to the range's raft group (and is called on every raft tick), so my understanding is that if we're persistently in a scenario where leaseholdership and leadership are uncollocated, it must imply that the underlying raft group is actually wedged. In such cases, the range should already be kaput and any lease transfer should also just fail / get stuck.

Does this line up with your / @nvanbenschoten's intuition?

Yes, this matches my intuition. maybeTransferRaftLeadershipToLeaseholderLocked does require the leaseholder to be caught up on the Raft log to be able to assume leadership, but I can't see that causing the kinds of problems we saw in 371738c, because the leaseholder is the only replica able to propose (a serious amount of) writes. So it seems unlikely, even under sustained load, that it would be persistently behind on the log.

But I think it's worth posing this question to the #kv slack channel, just in case anyone spots anything that we're missing. Basically just whether anyone notices any undesirable consequences of creating this dependency and preventing follower replicas that hold a range lease from transferring that lease away.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 1320 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, this matches my intuition. maybeTransferRaftLeadershipToLeaseholderLocked does require the leaseholder to be caught up on the Raft log to be able to assume leadership, but I can't see that causing the kinds of problems we saw in 371738c, because the leaseholder is the only replica able to propose (a serious amount of) writes. So it seems unlikely, even under sustained load, that it would be persistently behind on the log.

But I think it's worth posing this question to the #kv slack channel, just in case anyone spots anything that we're missing. Basically just whether anyone notices any undesirable consequences of creating this dependency and preventing follower replicas that hold a range lease from transferring that lease away.

Yes, that's how I understand it as well, and I can't think of any meaningful exceptions. This is probably the sort of problem we'll have to rely on randomized tests to catch, but worth asking around as @nvanbenschoten suggests.

This commit disallows the `replicateQueue` from initiating lease
transfers to replicas that may be in need of a raft snapshot. Note that
the `StoreRebalancer` already has a stronger form of this check since it
disallows lease transfers to replicas that are lagging behind the raft
leader (which includes the set of replicas that need a snapshot).

In cases where the raft leader is not the leaseholder, we disallow the
replicateQueue from any sort of lease transfer until leaseholdership and
leadership are collocated. We rely on calls to
`maybeTransferRaftLeadershipToLeaseholderLocked()` (called on every raft
tick) to make sure that such periods of leadership / leaseholdership
misalignment are ephemeral and rare.

Release justification: bug fix

Release note (bug fix): Fixes a bug that can cause prolonged
unavailability due to lease transfer to a replica that may be in need of
a raft snapshot.
@aayushshah15 aayushshah15 force-pushed the 20210901_noLeaseTransfersToStateSnapshot branch from 7a58202 to b344fb8 Compare September 2, 2021 18:38
@aayushshah15
Copy link
Copy Markdown
Contributor Author

aayushshah15 commented Sep 2, 2021

TFTRs

I'll merge on CI green. @erikgrinaker pointed out that the SHA selection for 20.2.16 happens off of the state of the release branch on September 6. I suggest that we immediately merge the backport PRs for this patch in order to get a few days' worth of nightly test coverage on those release branches before the SHA selection.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 2, 2021

Build succeeded:

@craig craig bot merged commit 2a33514 into cockroachdb:master Sep 2, 2021
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 2, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from b344fb8 to blathers/backport-release-20.2-69696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 20.2.x failed. See errors above.


error creating merge commit from b344fb8 to blathers/backport-release-21.1-69696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: lease transferred to uninitialized follower

4 participants