kvcoord: fail-fast when all replicas of a range are unavailable

With #33007, when a range loses quorum, we will generally have SQL clients experience fail-fast behavior: access to the unavailable range will immediately result in an error, as opposed to hanging indefinitely (as is the case in 21.2 and before). However, when a range has lost all replicas (or if all replicas are unreachable) I believe that DistSender will keep retrying forever:

- look up descriptor (say r1/1 r1/2 r1/3)
- try r1/1 (fail)
- try r1/2 (fail)
- try r1/3 (fail)
- hit a SendError [here]
- eject descriptor & re-lookup, goto beginning

While we do try to be resilient to network blips, there is probably value in a heuristic where if a request has been attempted twice for each possible replica, it's time to give up.

We will want to return a `RangeUnavailableError` in this case (similar to #74500) and have similar SQL UX (#74502).

[here]: https://github.com/cockroachdb/cockroach/blob/d54e0dda66b9b4a7122631ae7c222391322ae6bc/pkg/kv/kvclient/kvcoord/dist_sender.go#L2217-L2219


Jira issue: CRDB-12121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: fail-fast when all replicas of a range are unavailable #74503

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvcoord: fail-fast when all replicas of a range are unavailable #74503

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions