liveness: improve disk probes during node liveness updates 

When NodeLiveness updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and loseits leases.

However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack,in particular with lease acquisitions that do a synchronous heartbeat.

We need to change this process so that the sync write is done in a separate goroutine in order to
respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write.

Additionally, the sync writes should happen in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes withmany stores, to avoid spurious heartbeat failures under load.

Jira issue: CRDB-16078

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness: improve disk probes during node liveness updates #81786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

liveness: improve disk probes during node liveness updates #81786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions