-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: restarted node in need of snapshots can be wedged for long time #37906
Description
repro steps:
- set up a three node tpcc-2k cluster (any warehouse count will do, but with more data the problem is more obvious)
- run tpcc with --tolerate-errors and --wait=false
- kill one node, but leave everything else running
- wait a few minutes (maybe 10? more if it doesn't repro)
Restart the dead node and run a SELECT COUNT(*) on the tpcc tables through it
Chances are this will hang for minutes before returning. The reason is that none of the replicas will have been moved off that node while it was down (since there are only three nodes in the cluster), but now all of them will need Raft snapshots. There is no particular ordering in which these snapshots are requested. Some of our COUNT(*) requests (or even worse, liveness requests etc) may hit a replica on the local node in need of a snapshot. The lease request will hang until the snapshot is applied, which can take a long time.
We should do something here, like short-circuit lease request (for a blank NotLeaseholderErr) if we can with reasonable accuracy detect that the node is in need of a snapshot. This isn't trivial to detect on a follower (would be easier on the leader, but alas). We could also prioritize snapshots for system ranges higher to resolve the somewhat orthogonal problem that such hung requests on the system range can prevent the node from becoming live in the first place, which causes hard-to-diagnose problems.