Skip to content

storage: restarted node in need of snapshots can be wedged for long time #37906

@tbg

Description

@tbg

repro steps:

  • set up a three node tpcc-2k cluster (any warehouse count will do, but with more data the problem is more obvious)
  • run tpcc with --tolerate-errors and --wait=false
  • kill one node, but leave everything else running
  • wait a few minutes (maybe 10? more if it doesn't repro)

Restart the dead node and run a SELECT COUNT(*) on the tpcc tables through it

Chances are this will hang for minutes before returning. The reason is that none of the replicas will have been moved off that node while it was down (since there are only three nodes in the cluster), but now all of them will need Raft snapshots. There is no particular ordering in which these snapshots are requested. Some of our COUNT(*) requests (or even worse, liveness requests etc) may hit a replica on the local node in need of a snapshot. The lease request will hang until the snapshot is applied, which can take a long time.

We should do something here, like short-circuit lease request (for a blank NotLeaseholderErr) if we can with reasonable accuracy detect that the node is in need of a snapshot. This isn't trivial to detect on a follower (would be easier on the leader, but alas). We could also prioritize snapshots for system ranges higher to resolve the somewhat orthogonal problem that such hung requests on the system range can prevent the node from becoming live in the first place, which causes hard-to-diagnose problems.

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.docs-donedocs-known-limitation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions