Skip to content

kv/storage: requests need retries after a LeaseTransfer #8816

@andreimatei

Description

@andreimatei

Here's a nice one: node A executes a TransferLease, transferring to node B. Immediately after, if A receives a request, it will say NotLeaseHolderError and redirect to B (as the post-commit trigger for TransferLease has updated the lease).
B is not aware of this transfer, as it hasn't executed the respective trigger yet, so when it receives the request it will say NotLeaseHolderError and redirect back to A.
The DistSender.sendToReplicas runs through all the replicas and then returns a SendError, which the higher-level sendChunk takes to mean that the range descriptor in its cache is stale, and so it evicts it from the cache and tries again, with infinite retries. This hides the error from clients, but still this is not ideal.

I'm wondering what to do about it, if anything. I think options are:
a) We make the *Admin*TransferLease RPC (not the Raft TransferLease) block until B has applied the new lease. This is doable particularly since I'm currently introducing a command for reading the current lease. This only solves the problem for tests which explicitly do a transfer and then want to send more commands. And currently only tests use lease transfers... But then again, this is not really a problem for tests anyway.
b) We make the DistSender understand this situation and localize the retries in sendToReplicas. This can be done, for example, by having sendToReplicas keep track of the NotLeaseHolderErrors it gets and seeing if nodes contradict each other. This is helped by the fact that I'm currently working to include the current lease in these errors, so the DistSender could assert that this case is happening is the leases reported back by different nodes don't agree and the alleged lease holder by the most recent lease is itself reporting a stale lease.
c) We try to reduce the possibility of a node being oblivious to the fact that it has a lease coming its way by making a transferee aware of the incoming lease before Raft actually applies it, and have it block incoming requests for the respective range for a grace period, until the lease has been applied. Maybe there's a way to peak at the Raft commands a node votes on? And better yet, maybe there's a way to require that a particular node vote on a request before the Raft leader considers a proposal committed (so that a transferer can require the transferee to become aware of a transfer early)?
d) (suggested by @tschottdorf) We try to reduce the possibility of this happening by attempting to move Raft leadership before proposing a LeaseTransfer. This would increase the probability that the transferee applies the LeaseTransfer before the transferer returns control.
This proposal doesn't seem to have a downside, except that it's all best effort?

This would become a real problem in prod if we start using lease transfers on node shutdown and down-replication (which I think we should). It's not currently a problem because leases only move once the old one expires, in which case nodes decide to acquire a lease while serving some request, and that request is blocked until the lease is acquired. So there's never a situation in which a "lease holder" is not aware of the most recent lease. But of course this all has the bigger price of having to wait for a lease to expire until a new holder emerges.

One thing that became apparent while looking at this is that we don't have any visibility into DistSender retries.

@bdarnell any thoughts?

Metadata

Metadata

Assignees

Labels

A-kv-clientRelating to the KV client and the KV interface.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions