Skip to content

storage: allow removing leaseholder in ChangeReplicas #40333

@tbg

Description

@tbg

The replicate queue code is littered with decisions to transfer a lease away simply because the leaseholder should be removed. When the replication factor is low or there are constraints on the range it can happen that we can't transfer this lease away until we've added another replica. Adding another replica before removing the leaseholder is precisely what we are trying to avoid in #12768.

TestInitialPartitioning definitely hits this problem (it will fail if rebalancing is always carried out atomically), so it's a good starting point for investigation.

The lease transfer code that the queue uses (centered around allocator.TransferLeaseTarget) is very hard to reason about as well, so I'm not even sure if maybe it would fail to find a target when there is one sometimes.

With all that in mind it'd be nice if we could just issue any replication change, including one that removes the leaseholder. The reason we don't allow this today is that this will either wedge the range (because the lease will remain active, but the leader isn't serving the range any more) or cause potential anomalies (if we allow another node to get the lease without properly invalidating the old one).

The right way to make this work, I think, is to use the fact that the replica change removing the leaseholder is necessarily evaluated and proposed on the leaseholder. If we intercept that request accordingly and make sure that it causes the leaseholder to behave as if a TransferLease had been issued (setting its minProposedTS, etc) then we could treat a lease whose leaseholder store isn't in the descriptor as invalid (even if the epoch is still active), meaning that any member of the range could obtain the lease immediately after the replica change is through.
We'll need to allow leases to be obtained by VOTER_INCOMING replicas, but this is fine.
We could also add a leaseholder hint to the replica change to fix the lease transfer target, which is something the replicate queue (or allocator 2.0) would want to do.

Jira issue: CRDB-5549

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV TeamX-staleno-issue-activity

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions