Skip to content

liveness: improve disk probes during node liveness updates  #81786

@lunevalex

Description

@lunevalex

When NodeLiveness updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and loseits leases.

However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack,in particular with lease acquisitions that do a synchronous heartbeat.

We need to change this process so that the sync write is done in a separate goroutine in order to
respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write.

Additionally, the sync writes should happen in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes withmany stores, to avoid spurious heartbeat failures under load.

Jira issue: CRDB-16078

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions