-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: DistSender should detect lease expiration and redirect requests #105168
Description
The DistSender makes an educated guess about who the current leaseholder is, sends a request to it, and then simply waits for the response -- typically a successful response or a NotLeaseHolderError instructing it who the leaseholder is. However, in some cases the request never returns, and the DistSender waits indefinitely. Typical cases are:
- Disk stalls.
- Replica stalls.
- Raft reproposals, e.g. due to network partition.
With expiration-based leases, the remote replica will eventually lose its lease in these cases, but the DistSender cache will keep pointing to the stalled replica. Requests also remain stuck, which can cause the entire workload to stall if it has a bounded number of workers/connections that all get stuck.
The DistSender should detect when the remote replica loses its lease, and try to discover a new leaseholder elsewhere, redirecting the requests there when possible. Write requests can't easily be retried though, so we should only cancel them once we confirm that there is in fact a new leaseholder elsewhere.
This would improve resolution of both partial partitions (#103769) and disk/replica stalls (#104262).
Jira issue: CRDB-28907
Epic CRDB-25200