-
Notifications
You must be signed in to change notification settings - Fork 4.1k
liveness: improve disk probes during node liveness updates #81786
Description
When NodeLiveness updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and loseits leases.
However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack,in particular with lease acquisitions that do a synchronous heartbeat.
We need to change this process so that the sync write is done in a separate goroutine in order to
respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write.
Additionally, the sync writes should happen in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes withmany stores, to avoid spurious heartbeat failures under load.
Jira issue: CRDB-16078