Skip to content

kvserver: persistent outage when liveness leaseholder deadlocks #80713

@tbg

Description

@tbg

Describe the problem

See #79648.

Quoting an internal support issue:

RCA: n1 experienced a disk stall. It held the lease for the liveness range r2, and this information was cached by all nodes. When n1 locked up, all liveness requests continued to get redirected to n1. Since all of these requests had a timeout on them, they returned from DistSender.Send. This did not invalidate the cached leaseholder entry, thus rendering this outage permanent. It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

To Reproduce

See above PR (note that second commit has a hacky fix so to get a repro, remove that)

Expected behavior

Failover as usual

Additional data / screenshots
Environment:

  • master at time of writing and all past versions presumably

Additional context

Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.

Jira issue: CRDB-15539

gz#13737

Epic CRDB-19227

gz#19526

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-clientRelating to the KV client and the KV interface.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-2-temp-unavailabilityTemp crashes or other availability problems. Can be worked around or resolved by restarting.T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions