release-22.1: kvserver: don't transfer leases to draining nodes during scatters by aayushshah15 · Pull Request #80834 · cockroachdb/cockroach

aayushshah15 · 2022-04-30T22:14:09Z

Backport 4/4 commits from #79295.

Backport 1/1 commit from #79581.

Backport 1/1 commit from #81403.

/cc @cockroachdb/release

kvserver: introduce Allocator.ValidLeaseTargets()

This commit is a minor refactor of the Allocator.TransferLeaseTarget logic in
order to make it more readable and, to abstract out a new exported Allocator
method called ValidLeaseTargets().

The contract of ValidLeaseTargets() is as follows:

// ValidLeaseTargets returns a set of candidate stores that are suitable to be
// transferred a lease for the given range.
//
// - It excludes stores that are dead, or marked draining or suspect.
// - If the range has lease_preferences, and there are any non-draining,
// non-suspect nodes that match those preferences, it excludes stores that don't
// match those preferences.
// - It excludes replicas that may need snapshots. If replica calling this
// method is not the Raft leader (meaning that it doesn't know whether follower
// replicas need a snapshot or not), produces no results.

Previously, there were multiple places where we were performing the logic
that's encapsulated by ValidLeaseTargets(), which was a potential source of
bugs. This is an attempt to unify this logic in one place that's relatively
well-tested. This commit is only a refactor, and does not attempt to change any
behavior. As such, no existing tests have been changed, with the exception of a
subtest inside TestAllocatorTransferLeaseTargetDraining. See the comment over
that subtest to understand why the behavior change made by this patch is
desirable.

The next commit in this PR uses this method to fix (at least part of) #74691.

Release note: none

kvserver: don't transfer leases to draining nodes during scatters

Previously, AdminScatter called with the RandomizeLeases option could
potentially transfer leases to nodes marked draining. This commit leverages
the refactor from the last commit to fix this bug by first filtering the set of
candidates down to a set of valid candidates that meet lease preferences and
are not marked suspect or draining.

Relates to and fixes a part of #74691.

Release note (bug fix): Fixes a bug where draining / drained nodes could
re-acquire leases during an import or an index backfill.

Release justification: bug fix.

blathers-crl · 2022-04-30T22:14:11Z

cockroach-teamcity · 2022-04-30T22:14:21Z

This change is

nvb

I did not do a pass on some parts of this before they landed on master, so I left a few comments below. Feel free to address them in a separate PR on master and then add a commit to this backport PR.

Also, we saw in https://github.com/cockroachlabs/support/issues/1569 that this fixes a bug where the allocator could skip the excludeReplicasInNeedOfSnapshots protection when lease preferences are present. This backport will fix the bug for release-22.1. We'd like to also fix it for release-21.2. Do you have a sense of the difficulty of a backport of this change to release-21.2? We could do something more targeted if it's too much of a lift.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 4 of 4 files at r3, 6 of 6 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/allocator.go line 1501 at r2 (raw file):

	candidates := make([]roachpb.ReplicaDescriptor, 0, len(existing))
	for i := range existing {
		if existing[i].GetType() != roachpb.VOTER_FULL {

In #74546, we began allowing VOTER_INCOMING replicas to receive the lease. Should we be doing the same here?

Should we be using CheckCanReceiveLease in this function?

pkg/kv/kvserver/allocator.go line 1579 at r2 (raw file):

	// If there are any replicas that do match lease preferences, then we check if
	// the existing leaseholder is one of them.
	preferred := a.preferredLeaseholders(conf, candidates)

Should these preferred replicas be passed through excludeReplicasInNeedOfSnapshots, or is the idea that a later call to ValidLeaseTargets will perform this check?

pkg/kv/kvserver/replicate_queue.go line 1052 at r1 (raw file):

		conf,
		transferLeaseOptions{
			goal:   leaseCountConvergence,

Is this changing behavior? Isn't followTheWorkload the zero-value for this enum? Is the change in behavior intentional?

pkg/kv/kvserver/store_rebalancer.go line 489 at r4 (raw file):

			replWithStats.repl.leaseholderStats,
			true, /* forceDecisionWithoutStats */
			transferLeaseOptions{

nit: consider adding the ExcludeLeaseRepl field back in so that this call is explicit.

pkg/kv/kvserver/allocator_test.go line 1746 at r4 (raw file):

		existing       []roachpb.ReplicaDescriptor
		leaseholder    roachpb.StoreID
		allowLeaseRepl bool

nit: should we use excludeLeaseRepl here as well so that readers of the test don't need to negate the input? Same question below.

aayushshah15

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/allocator.go line 1501 at r2 (raw file):