kvserver: transfer lease when acquiring lease outside preferences by kvoli · Pull Request #107507 · cockroachdb/cockroach

kvoli · 2023-07-24T23:22:12Z

First 2 commits are from #107505.

When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds).

However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region.

This PR will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). If the first process fails, the replica will be sent to purgatory and retried soon after.

Additionally, two roachtests are added for lease preference conformance timing. The first test, .../partial-first-preference-down, takes down one of the preferred locality nodes (1/2). The second test, .../full-first-preference-down, takes down both the preferred locality nodes (2/2).

Resolves #106100.
Epic: none

Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible.

cockroach-teamcity · 2023-07-24T23:22:28Z

This change is

erikgrinaker

Mostly LGTM, except the concern about spurious errors. You should have @andrewbaptist have a look at this too, since he's more familiar with the allocator.

Reviewed 7 of 7 files at r1, 4 of 4 files at r2, 4 of 4 files at r3, 3 of 3 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, @kvoli, and @smg260)

pkg/cmd/roachtest/tests/failover.go line 1773 at r3 (raw file):

	var value float64
	conn := c.Conn(ctx, t.L(), node)
	defer conn.Close()

nit: this isn't necessary afaik, because Conn() returns a sql.DB, whose connections are managed automatically. If it returned a sql.Conn then we'd have to return it to the pool.

https://pkg.go.dev/database/sql#DB.Close

pkg/cmd/roachtest/tests/lease_preferences.go line 126 at r3 (raw file):

				// ...
				// dc=N: n2N-1 n2N
				fmt.Sprintf("--locality=region=fake-region,zone=fake-zone,dc=%d", (node-1)/2+1))

nit: consider install.NumRacksOption() which automatically sets up N racks [0,N) with locality rack and round-robins nodes across them.

pkg/cmd/roachtest/tests/lease_preferences.go line 183 at r3 (raw file):

	require.NoError(t, WaitForReplication(ctx, t, conn, spec.replFactor))
	c.Run(ctx, c.Node(numNodes), fmt.Sprintf(
		`./cockroach workload init kv --splits %d {pgurl:%d}`, spec.ranges, numNodes))

May want to do a scatter here too, with --scatter.

pkg/cmd/roachtest/tests/lease_preferences.go line 200 at r3 (raw file):

	// Wait for the preference conformance with some non-live nodes.
	checkLeasePreferenceConformance(ctx)

I don't think there's anything here that causes us to reacquire leases for the ranges that lost their leaseholders. We only do this if there's traffic on the range. Considering doing a select count(*) from kv to ensure we acquire a lease on all ranges (typically acquired on the gateway node, or the replica closest to it).

pkg/cmd/roachtest/tests/lease_preferences.go line 201 at r3 (raw file):

	// Wait for the preference conformance with some non-live nodes.
	checkLeasePreferenceConformance(ctx)
	// Start back up the stopped nodes, before finishing the test.

This isn't necessary if you set SkipPostValidations: registry.PostValidationNoDeadNodes.

pkg/kv/kvserver/replica_proposal.go line 529 at r4 (raw file):

			preferenceStatus := makeLeasePreferenceStatus(st, r.store.StoreID(), r.store.Attrs(),
				r.store.nodeDesc.Attrs, r.store.nodeDesc.Locality, r.mu.conf.LeasePreferences)
			switch preferenceStatus {

Should we have a default clause to guard against new statuses?

pkg/kv/kvserver/replica_range_lease.go line 116 at r4 (raw file):

// configured preferences, it is enqueued in the replicate queue for
// processing.
var LeaseCheckPreferencesOnAcquisitionEnabled = settings.RegisterBoolSetting(

Consider a TODO to remove this in 24.1.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

	// removal, non-voter addition and rebalancing.
	// See allocatorimpl.AllocatorAction.Priority.
	replicateQueueLeasePreferencePriority = 1001

Would it make sense to enqueue this above upreplication, if we're using a non-zero priority anyway? Presumably these operations are cheap, as long as they can't starve out upreplication on failures.

pkg/kv/kvserver/allocator/plan/replicate.go line 884 at r5 (raw file):

		)
		if err != nil {
			return nil, stats, err

This error can bubble up to callers that don't expect it. From a cursory read of the code, looks like it'll hit adminScatter and ReplicateQueueDryRun. The only reason why we return this error up is to get the replica into purgatory -- is there a better way to do that which won't leak errors for other callers?

kvoli

Thanks for the review. I agree this requires another set of eyes.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @herkolategan, and @smg260)

pkg/cmd/roachtest/tests/failover.go line 1773 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: this isn't necessary afaik, because Conn() returns a sql.DB, whose connections are managed automatically. If it returned a sql.Conn then we'd have to return it to the pool.

https://pkg.go.dev/database/sql#DB.Close

When I don't add this in, I see leaked goroutines after the test passes:

15:58:29 main.go:569: runTests destroying all clusters
15:58:34 leaktest.go:161: Leaked goroutine: goroutine 12 [select, 1 minutes]:
database/sql.(*DB).connectionOpener(0xc004c41d40, {0x84685a8, 0xc004c2e740})
        GOROOT/src/database/sql/sql.go:1224 +0x8d
created by database/sql.OpenDB
        GOROOT/src/database/sql/sql.go:792 +0x18d
Leaked goroutine: goroutine 14 [select, 1 minutes]:
database/sql.(*DB).connectionOpener(0xc003a07d40, {0x84685a8, 0xc001402b00})
        GOROOT/src/database/sql/sql.go:1224 +0x8d
created by database/sql.OpenDB

pkg/cmd/roachtest/tests/lease_preferences.go line 126 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: consider install.NumRacksOption() which automatically sets up N racks [0,N) with locality rack and round-robins nodes across them.

Perhaps premature on my part, I was envisioning some additional tests here which have more interesting topologies. Specifying a non-rack locality, like dc makes this a bit easier.

pkg/cmd/roachtest/tests/lease_preferences.go line 183 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

May want to do a scatter here too, with --scatter.

I do want a scatter here. I thought these were scattered by default for workload imports? Added a --scatter here just to be sure.

pkg/cmd/roachtest/tests/lease_preferences.go line 200 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I don't think there's anything here that causes us to reacquire leases for the ranges that lost their leaseholders. We only do this if there's traffic on the range. Considering doing a select count(*) from kv to ensure we acquire a lease on all ranges (typically acquired on the gateway node, or the replica closest to it).

The test asserts on the preference metrics. If there's no valid leaseholder, the violating/undersatisfied metrics won't be bumped - so the test still passes.

Do you think we get more coverage out of doing this? I'm thinking we may want both?

pkg/cmd/roachtest/tests/lease_preferences.go line 201 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This isn't necessary if you set SkipPostValidations: registry.PostValidationNoDeadNodes.

Ack - I set that originally but thought it might be preferred to bring up the dead nodes anyway. Thinking again, its not really adding much.

Swapped to SkipPostValidations: registry.PostValidationNoDeadNodes

pkg/kv/kvserver/replica_proposal.go line 529 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Should we have a default clause to guard against new statuses?

Do you mean moving all the statuses, other than leasePreferencesViolating under default?

Added a default catch all here, currently it catches leasePreferencesUnknown.

pkg/kv/kvserver/replica_range_lease.go line 116 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Consider a TODO to remove this in 24.1.

Done.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Would it make sense to enqueue this above upreplication, if we're using a non-zero priority anyway? Presumably these operations are cheap, as long as they can't starve out upreplication on failures.

Once it gets processed by the queue, the action is recomputed and only when there's no higher priority action or rebalancing will the lease preference stuff happen.

This would be expedient when these ranges don't need any action performed, but there's up-replication required on other ranges.

The risk I see, is that these ranges also require some other higher priority action e.g. decommissioning etc. - and that this would invert the priority queue ordering for up-replication.

pkg/kv/kvserver/allocator/plan/replicate.go line 884 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This error can bubble up to callers that don't expect it. From a cursory read of the code, looks like it'll hit adminScatter and ReplicateQueueDryRun. The only reason why we return this error up is to get the replica into purgatory -- is there a better way to do that which won't leak errors for other callers?

HavingReplicateQueueDryRun bubble up this error seems useful to me. You wouldn't want to get different behavior when debugging via a dry run.

Good point on scatter. It won't do anything useful by returning this error on scatter. I've updated to swallow the error on scatter and log (v=3) instead.

erikgrinaker

LGTM, but yeah, let's get Andrew's ✅ as well.

Reviewed 11 of 11 files at r6, 4 of 4 files at r7, 3 of 3 files at r8, 5 of 5 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, @kvoli, and @smg260)

pkg/cmd/roachtest/tests/failover.go line 1773 at r3 (raw file):

Previously, kvoli (Austen) wrote…

When I don't add this in, I see leaked goroutines after the test passes:

15:58:29 main.go:569: runTests destroying all clusters
15:58:34 leaktest.go:161: Leaked goroutine: goroutine 12 [select, 1 minutes]:
database/sql.(*DB).connectionOpener(0xc004c41d40, {0x84685a8, 0xc004c2e740})
        GOROOT/src/database/sql/sql.go:1224 +0x8d
created by database/sql.OpenDB
        GOROOT/src/database/sql/sql.go:792 +0x18d
Leaked goroutine: goroutine 14 [select, 1 minutes]:
database/sql.(*DB).connectionOpener(0xc003a07d40, {0x84685a8, 0xc001402b00})
        GOROOT/src/database/sql/sql.go:1224 +0x8d
created by database/sql.OpenDB

Hm ok, that's annoying -- thanks!

pkg/cmd/roachtest/tests/lease_preferences.go line 126 at r3 (raw file):

Previously, kvoli (Austen) wrote…

Perhaps premature on my part, I was envisioning some additional tests here which have more interesting topologies. Specifying a non-rack locality, like dc makes this a bit easier.

I see, that's fine.

pkg/cmd/roachtest/tests/lease_preferences.go line 183 at r3 (raw file):

Previously, kvoli (Austen) wrote…

I do want a scatter here. I thought these were scattered by default for workload imports? Added a --scatter here just to be sure.

I don't believe they are, at least not for a kv init. Other workloads may scatter automatically.

pkg/cmd/roachtest/tests/lease_preferences.go line 200 at r3 (raw file):

Previously, kvoli (Austen) wrote…

The test asserts on the preference metrics. If there's no valid leaseholder, the violating/undersatisfied metrics won't be bumped - so the test still passes.

Do you think we get more coverage out of doing this? I'm thinking we may want both?

I think we'll have to do this, otherwise what are we even testing?

Consider the trivial case, where all leases are on 1 node. If the node dies, and we don't acquire leases anywhere else, then there are 0 valid leases in the cluster, and the test trivially passes.

To test what we want to test here, we have to make sure leases are acquired outside of the preferences, and then moved where we want them to be. We can either do this by running a count(*) on a node outside of the preferences, or explicitly move the leases there with an ALTER RANGE RELOCATE LEASE.

FWIW, on the original issue, I ran some experiments with killing nodes and for some reason this did seem to get leases back to preferred regions more reliably (didn't look into why). When I moved the leases manually, I would always get lease violations that persisted for several minutes. So we may want to have a case for this too, since it appears more reliable, but your call.

pkg/kv/kvserver/replica_proposal.go line 529 at r4 (raw file):

Previously, kvoli (Austen) wrote…

Do you mean moving all the statuses, other than leasePreferencesViolating under default?

Added a default catch all here, currently it catches leasePreferencesUnknown.

No, I meant having explicit cases for all known statuses, and error out in the default case -- to make sure we don't forget to handle any new statuses that may be added later. Like this:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2140 to 2147 in 8ef4e36

    
           switch ba.RoutingPolicy { 
        
           case kvpb.RoutingPolicy_LEASEHOLDER: 
        
           	replicaFilter = OnlyPotentialLeaseholders 
        
           case kvpb.RoutingPolicy_NEAREST: 
        
           	replicaFilter = AllExtantReplicas 
        
           default: 
        
           	log.Fatalf(ctx, "unknown routing policy: %s", ba.RoutingPolicy) 
        
           }

But we could just check for Violating alone here too, I suppose, since that's the only status we care about. Nbd.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

Previously, kvoli (Austen) wrote…

Once it gets processed by the queue, the action is recomputed and only when there's no higher priority action or rebalancing will the lease preference stuff happen.

This would be expedient when these ranges don't need any action performed, but there's up-replication required on other ranges.

The risk I see, is that these ranges also require some other higher priority action e.g. decommissioning etc. - and that this would invert the priority queue ordering for up-replication.

Ok, I trust your and Andrew's judgement on this.

andrewbaptist

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @kvoli, and @smg260)

pkg/kv/kvserver/replica_proposal.go line 533 at r9 (raw file):

				// We could also enqueue the lease when we are a less preferred
				// leaseholder, however the replicate queue will eventually get to it and
				// we already satisfy _some_ preference.

It seems best to enqueue this rather than wait in the less preferred state. I could easily see customers specifying every possible node as part of their constraint definition from best to worst. In that case, we will always be in LessPreferred and never see Violating. For a similar reason, it might just be simpler to remove the difference between these two enums (LessPreferred and Violating). It seems like the only harm of queuing is we do additional work.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Ok, I trust your and Andrew's judgement on this.

I agree with @kvoli. However, consider moving this into constant into allocator.go in the Priority() method (even though its not used there directly). Since it is intimately tied to those priorities it is important that anyone who changes them is aware of it.

pkg/kv/kvserver/replicate_queue.go line 822 at r9 (raw file):

	} else if change.Action == allocatorimpl.AllocatorConsiderRebalance &&
		!change.Replica.LeaseViolatesPreferences(ctx) {

There is a "bad" case here where we finish rebalancing and there is a place to put this that doesn't violate lease preference. We will "busy loop" here waiting for a replica to be created. Take the example with 3 regions, and only 1 node per region. If the preference is set for one region and that region goes down, we will constantly be adding this back. Since this is "low priority" it doesn't block any other actions, but it will potentially stop other rebalancing. Have you tried this case (either manually or in a test?)

andrewbaptist

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @kvoli, and @smg260)

pkg/kv/kvserver/replicate_queue.go line 822 at r9 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

There is a "bad" case here where we finish rebalancing and there is a place to put this that doesn't violate lease preference. We will "busy loop" here waiting for a replica to be created. Take the example with 3 regions, and only 1 node per region. If the preference is set for one region and that region goes down, we will constantly be adding this back. Since this is "low priority" it doesn't block any other actions, but it will potentially stop other rebalancing. Have you tried this case (either manually or in a test?)

Retract: I forgot the purgatory state will handle this. Maybe just add a comment here describing how this relates to purgatory. It is pretty subtle.

kvoli

Thanks for the review Erik! I'll wait for Andrew to approve before merging.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @herkolategan, and @smg260)

pkg/cmd/roachtest/tests/lease_preferences.go line 200 at r3 (raw file):

I think we'll have to do this, otherwise what are we even testing?

In the testing I've done, there are always num ranges valid leases - so It was being tested. I didn't check the acquisition mechanism - maybe this won't hold for 23.1.

Consider the trivial case, where all leases are on 1 node. If the node dies, and we don't acquire leases anywhere else, then there are 0 valid leases in the cluster, and the test trivially passes.

See above, I'm assuming that (most of) these leases get acquired somewhere even without intervention. Seems like a shaky assumption, so I'll add in a select count(*) and load a small amount of rows.

When I moved the leases manually, I would always get lease violations that persisted for several minutes. So we may want to have a case for this too, since it appears more reliable, but your call.

Good idea, added a test for this case.

pkg/kv/kvserver/replica_proposal.go line 529 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

No, I meant having explicit cases for all known statuses, and error out in the default case -- to make sure we don't forget to handle any new statuses that may be added later. Like this:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2140 to 2147 in 8ef4e36

switch ba.RoutingPolicy {

case kvpb.RoutingPolicy_LEASEHOLDER:

replicaFilter = OnlyPotentialLeaseholders

case kvpb.RoutingPolicy_NEAREST:

replicaFilter = AllExtantReplicas

default:

log.Fatalf(ctx, "unknown routing policy: %s", ba.RoutingPolicy)

}

But we could just check for Violating alone here too, I suppose, since that's the only status we care about. Nbd.

Ah yeah, makes sense. Updated to fatal on the default case, and expclitily enumerate the others.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Ok, I trust your and Andrew's judgement on this.

ack - I'll see what @andrewbaptist thinks.

kvoli

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @herkolategan, and @smg260)

pkg/kv/kvserver/replica_proposal.go line 533 at r9 (raw file):

It seems best to enqueue this rather than wait in the less preferred state.

The reason I chose not to is that it would also compete with lease preference violation replicas on mass lease acquisition. I am concerned about the additional work. It doesn’t seem worth the risk in a backport, however does seem useful to add on master to bake.

it might just be simpler to remove the difference between these two enums (LessPreferred and Violating).

I think the difference is important. These are ordered preferences, not constraints after all. The difference between satisfying some preference and satisfying none is largely dependent on the user's setup.

pkg/kv/kvserver/replicate_queue.go line 89 at r4 (raw file):

Previously, kvoli (Austen) wrote…

ack - I'll see what @andrewbaptist thinks.

There are many places specifying a priority outside the regular allocatorimpl.AllocatorAction enum.

This would need a specific action associated with it, and there are currently no actions associated with leases. I'd prefer if we could defer introducing new action+priority combos until we have a more coherent story with leases and #90110. This struck me as a reasonable middle ground for now.

pkg/kv/kvserver/replicate_queue.go line 822 at r9 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Retract: I forgot the purgatory state will handle this. Maybe just add a comment here describing how this relates to purgatory. It is pretty subtle.

Good idea, added a comment here to describe what would happen, and linking the purgatory error.

erikgrinaker

Reviewed 3 of 3 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, @kvoli, and @smg260)

pkg/cmd/roachtest/tests/lease_preferences.go line 200 at r3 (raw file):

In the testing I've done, there are always num ranges valid leases - so It was being tested.

Hm ok. We did add eager lease acquisition in #101098, but that only works for unquiesced ranges, so I wouldn't really expect that to fire here. The other possibilities are that we're using expiration leases (this test doesn't, but maybe it should use MetamorphicLeases), that the replicate queue acquired the leases (it does so once per cycle), or that some other background process was accessing the ranges (e.g. SQL stats scans).

Either way, yeah, let's do a scan to make sure.

load a small amount of rows

We don't need any rows for this to work, a select count(*) needs valid leases on all ranges to know that they're empty.

added a test for this case

Thanks!

andrewbaptist

Thanks for the clarification and changes!

Previously, there were no e2e tests which assert on lease preference conformance. This commit adds three `lease-preferences` tests, which assert that lease preferences are conformed to. The time taken to conform is also reported in the test logs. The first test, `.../partial-first-preference-down`, takes down one of the preferred locality nodes (1/2). The second test, `.../full-first-preference-down`, takes down both the preferred locality nodes (2/2). The third test, `.../manual-violating-transfer`, manually transfers leases to a locality which violates the configured leases preferences. Informs: cockroachdb#106100 Epic: none Release note: None

When a leaseholder is lost, any surviving replica may acquire the lease, even if it violates lease preferences. There are two main reasons for this: we need to elect a new Raft leader who will acquire the lease, which is agnostic to lease preferences, and there may not even be any surviving replicas that satisfy the lease preferences at all, so we don't want to keep the range unavailable while we try to figure this out (considering e.g. network timeouts can delay this for many seconds). However, after acquiring a lease, we rely on the replicate queue to transfer the lease back to a replica that conforms with the preferences, which can take several minutes. In multi-region clusters, this can cause severe latency degradation if the lease is acquired in a remote region. This patch will detect lease preference violations when a replica acquires a new lease, and eagerly enqueue it in the replicate queue for transfer (if possible). This behavior can be turned off (on by default), by the cluster setting `kv.lease.check_preferences_on_acquisition.enabled`. Epic: none Release note (bug fix): When losing a leaseholder and using lease preferences, the lease can be acquired by any other replica (regardless of lease preferences) in order to restore availability as soon as possible. The new leaseholder will now immediately check if it violates the lease preferences, and attempt to transfer the lease to a replica that satisfies the preferences if possible. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>

This patch places replicas in the replicate queue purgatory when it has a lease violating the lease preferences and it's unable to find a suitable target. This causes the replica to be retried more often. This will only trigger when replicas are eagerly enqueued (typically when we acquire a new lease that violates preferences), since we otherwise don't attempt to enqueue replicas when they don't have a valid lease transfer target. This patch also enables requeuing replicas after a successful rebalance, when the lease violates preferences. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>

kvoli · 2023-07-25T22:40:08Z

TYFTRs!

rebased to pick up first two commits.

This will require a manual backport to resolve conflicts, which I'll open after merging.

kvoli · 2023-07-26T01:49:06Z

bors r=erikgrinaker,andrewbaptist

craig · 2023-07-26T06:06:49Z

Build succeeded:

Bazel Essential CI (Cockroach)

In cockroachdb#107507, we began eagerly enqueuing into the replicate queue, when acquiring a replica acquired a new lease which violated lease preferences. Lease preferences were only considered violated when the lease itself was valid. In cockroachdb#107507, we saw that it is uncommon, but possible for an invalid lease to be acquired, violate lease preferences and not be enqueued as a result. The end result was a leaseholder violating the applied lease preferences which would not be resolved until the next scanner cycle. Update the eager enqueue check to only check that the replica is the incoming leaseholder when applying the lease, and that the replica violates the applied lease preferences. The check now applies on any lease acquisition, where previously it only occurred on the leaseholder changing. Note the purgatory error introduced in cockroachdb#107507, still checks that the lease is valid and owned by the store before proceeding. It is a condition that the lease must be valid+owned by the store to have a change planned, so whilst it is possible the lease becomes invalid somewhere in-between planning, when the replica applies a valid lease, it will still be enqueued, so purgatory is unnecessary. Fixes: cockroachdb#107862 Release note: None

107075: asim: add random predefined cluster config selection r=kvoli a=wenyihu6 This patch takes the first step towards a randomized framework by enabling asim testing to randomly select a cluster information configuration from a set of predefined choices. These choices are hardcoded and represent common cluster configurations. TestRandomized now takes in `true` for randOptions.cluster to enable random cluster config selection. This provides two modes for cluster generation: 1. Default: currently set to 3 nodes and 1 store per node 2. Random: randomly select a predefined cluster info Part of: #106311 Release note: None 107967: kvserver: ignore lease validity when checking lease preferences r=erikgrinaker,andrewbaptist a=kvoli In #107507, we began eagerly enqueuing into the replicate queue, when acquiring a replica acquired a new lease which violated lease preferences. Lease preferences were only considered violated when the lease itself was valid. In #107507, we saw that it is uncommon, but possible for an invalid lease to be acquired, violate lease preferences and not be enqueued as a result. The end result was a leaseholder violating the applied lease preferences which would not be resolved until the next scanner cycle. Update the eager enqueue check to only check that the replica is the incoming leaseholder when applying the lease, and that the replica violates the applied lease preferences. Note the purgatory error introduced in #107507, still checks that the lease is valid and owned by the store before proceeding. It is a condition that the lease must be valid+owned by the store to have a change planned, so whilst it is possible the lease becomes invalid somewhere in-between planning, when the replica applies a valid lease, it will still be enqueued so purgatory is unnecessary. Fixes: #107862 Release note: None 107984: roachtest: bugfix testeng grafana link missing cluster name r=renatolabs a=annrpom Previously, the testeng grafana link generated after a roachtest failure directed the user to cockroachlabs.com due to a missing cluster name from the link. This patch should fix this issue by getting the cluster name from a `*githubIssues.cluster.name` instead of the `clusterName` from roachtest/cluster.go. Fixes: #107894 Release note (bug fix): The link to testeng grafana that is generated on a roachtest failure should now properly direct to grafana. Co-authored-by: wenyihu6 <wenyi.hu@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Annie Pompa <annie@cockroachlabs.com>

In cockroachdb#107507, we began eagerly enqueuing into the replicate queue, when acquiring a replica acquired a new lease which violated lease preferences. Lease preferences were only considered violated when the lease itself was valid. In cockroachdb#107507, we saw that it is uncommon, but possible for an invalid lease to be acquired, violate lease preferences and not be enqueued as a result. The end result was a leaseholder violating the applied lease preferences which would not be resolved until the next scanner cycle. Update the eager enqueue check to only check that the replica is the incoming leaseholder when applying the lease, and that the replica violates the applied lease preferences. The check now applies on any lease acquisition, where previously it only occurred on the leaseholder changing. Note the purgatory error introduced in cockroachdb#107507, still checks that the lease is valid and owned by the store before proceeding. It is a condition that the lease must be valid+owned by the store to have a change planned, so whilst it is possible the lease becomes invalid somewhere in-between planning, when the replica applies a valid lease, it will still be enqueued, so purgatory is unnecessary. Fixes: cockroachdb#107862 Release note: None

kvoli self-assigned this Jul 24, 2023

kvoli changed the title ~~constraint: tighten inputs to CheckConjunction~~ kvserver: transfer lease when acquiring lease outside preferences Jul 24, 2023

kvoli force-pushed the 230721.lease-preferences branch from b50b0e7 to fe7db4c Compare July 24, 2023 23:28

kvoli requested review from andrewbaptist and erikgrinaker July 24, 2023 23:32

kvoli force-pushed the 230721.lease-preferences branch from fe7db4c to d94a20b Compare July 24, 2023 23:47

kvoli marked this pull request as ready for review July 24, 2023 23:51

kvoli requested review from a team as code owners July 24, 2023 23:51

kvoli requested review from a team, herkolategan and smg260 and removed request for a team July 24, 2023 23:51

kvoli force-pushed the 230721.lease-preferences branch from d94a20b to 27ec04c Compare July 25, 2023 12:42

erikgrinaker reviewed Jul 25, 2023

View reviewed changes

kvoli force-pushed the 230721.lease-preferences branch from 27ec04c to e283c4a Compare July 25, 2023 16:06

kvoli commented Jul 25, 2023

View reviewed changes

kvoli requested a review from erikgrinaker July 25, 2023 16:15

erikgrinaker approved these changes Jul 25, 2023

View reviewed changes

andrewbaptist suggested changes Jul 25, 2023

View reviewed changes

kvoli force-pushed the 230721.lease-preferences branch from e283c4a to dcca976 Compare July 25, 2023 19:11

kvoli commented Jul 25, 2023

View reviewed changes

kvoli force-pushed the 230721.lease-preferences branch from dcca976 to 4dd2975 Compare July 25, 2023 19:29

kvoli commented Jul 25, 2023

View reviewed changes

kvoli requested a review from andrewbaptist July 25, 2023 19:35

erikgrinaker reviewed Jul 25, 2023

View reviewed changes

kvoli force-pushed the 230721.lease-preferences branch 2 times, most recently from 711642c to 5451094 Compare July 25, 2023 20:18

andrewbaptist approved these changes Jul 25, 2023

View reviewed changes

kvoli and others added 3 commits July 25, 2023 22:38

kvoli force-pushed the 230721.lease-preferences branch from 5451094 to 1fac2f8 Compare July 25, 2023 22:39

craig bot merged commit 68985a2 into cockroachdb:master Jul 26, 2023

erikgrinaker mentioned this pull request Jul 26, 2023

kvserver: transfer lease when acquiring lease outside preferences #106079

Closed

kvoli mentioned this pull request Jul 26, 2023

release-23.1: kvserver: transfer lease when acquiring lease outside preferences #107625

Merged

kvoli mentioned this pull request Aug 1, 2023

kvserver: ignore lease validity when checking lease preferences #107967

Merged

kvoli mentioned this pull request Aug 10, 2023

kvserver: eager lease preference enforcement occasionally fails when acquiring node fails liveness #108512

Open

kvoli mentioned this pull request Oct 31, 2023

kvserver: use simulator to evaluate 23.2 allocation changes #111838

Closed

kvoli mentioned this pull request Dec 11, 2023

kvserver: eagerly enqueue less preferred leases for transfer #116081

Closed

kvoli mentioned this pull request Feb 23, 2025

roachtest: lease-preferences/manual-violating-transfer failed #141359

Closed

	switch ba.RoutingPolicy {
	case kvpb.RoutingPolicy_LEASEHOLDER:
	replicaFilter = OnlyPotentialLeaseholders
	case kvpb.RoutingPolicy_NEAREST:
	replicaFilter = AllExtantReplicas
	default:
	log.Fatalf(ctx, "unknown routing policy: %s", ba.RoutingPolicy)
	}

Conversation

kvoli commented Jul 24, 2023

Uh oh!

cockroach-teamcity commented Jul 24, 2023

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment • edited by cockroach-dev-inf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kvoli commented Jul 25, 2023

Uh oh!

kvoli commented Jul 26, 2023

Uh oh!

craig bot commented Jul 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading