kvserver: persistent outage when liveness leaseholder deadlocks

**Describe the problem**

See https://github.com/cockroachdb/cockroach/pull/79648.

Quoting an internal support issue:

> RCA: n1 experienced a disk stall. It held the lease for the liveness range r2, and this information was cached by all nodes. When n1 locked up, all liveness requests continued to get redirected to n1. Since all of these requests had a timeout on them, they returned from DistSender.Send. This did not invalidate the cached leaseholder entry, thus rendering this outage permanent. It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

**To Reproduce**

See above PR (note that second commit has a hacky fix so to get a repro, remove that)

**Expected behavior**

Failover as usual

**Additional data / screenshots**
**Environment:**
- master at time of writing and all past versions presumably

**Additional context**

Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.


Jira issue: CRDB-15539

gz#13737

Epic CRDB-19227

gz#19526

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: persistent outage when liveness leaseholder deadlocks #80713

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: persistent outage when liveness leaseholder deadlocks #80713

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions