-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: persistent outage when liveness leaseholder deadlocks #80713
Description
Describe the problem
See #79648.
Quoting an internal support issue:
RCA: n1 experienced a disk stall. It held the lease for the liveness range r2, and this information was cached by all nodes. When n1 locked up, all liveness requests continued to get redirected to n1. Since all of these requests had a timeout on them, they returned from DistSender.Send. This did not invalidate the cached leaseholder entry, thus rendering this outage permanent. It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.
To Reproduce
See above PR (note that second commit has a hacky fix so to get a repro, remove that)
Expected behavior
Failover as usual
Additional data / screenshots
Environment:
- master at time of writing and all past versions presumably
Additional context
Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.
Jira issue: CRDB-15539
gz#13737
Epic CRDB-19227
gz#19526