Skip to content

liveness: improve disk probes during node liveness updates#81133

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:liveness-write-ctx
May 18, 2022
Merged

liveness: improve disk probes during node liveness updates#81133
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:liveness-write-ctx

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented May 8, 2022

When NodeLiveness updates the liveness record (e.g. during
heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and lose
its leases.

However, this sync write could block indefinitely, and would not respect
the caller's context, which could cause the caller to stall rather than
time out. This in turn could lead to stalls higher up in the stack,
in particular with lease acquisitions that do a synchronous heartbeat.

This patch does the sync write in a separate goroutine in order to
respect the caller's context. The write operation itself will not
(can not) respect the context, and may thus leak a goroutine. However,
concurrent sync writes will coalesce onto an in-flight write.

Additionally, this runs the sync writes in parallel across all disks,
since we can now trivially do so. This may be advantageous on nodes with
many stores, to avoid spurious heartbeat failures under load.

Touches #81100.

Release note (bug fix): Disk write probes during node liveness
heartbeats will no longer get stuck on stalled disks, instead returning
an error once the operation times out. Additionally, disk probes now run
in parallel on nodes with multiple stores.

@erikgrinaker erikgrinaker requested a review from andreimatei May 8, 2022 14:08
@erikgrinaker erikgrinaker requested review from a team as code owners May 8, 2022 14:08
@erikgrinaker erikgrinaker self-assigned this May 8, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the liveness-write-ctx branch 2 times, most recently from 14e0b50 to fbd8937 Compare May 18, 2022 07:43
When `NodeLiveness` updates the liveness record (e.g. during
heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and lose
its leases.

However, this sync write could block indefinitely, and would not respect
the caller's context, which could cause the caller to stall rather than
time out. This in turn could lead to stalls higher up in the stack,
in particular with lease acquisitions that do a synchronous heartbeat.

This patch does the sync write in a separate goroutine in order to
respect the caller's context. The write operation itself will not
(can not) respect the context, and may thus leak a goroutine. However,
concurrent sync writes will coalesce onto an in-flight write.

Additionally, this runs the sync writes in parallel across all disks,
since we can now trivially do so. This may be advantageous on nodes with
many stores, to avoid spurious heartbeat failures under load.

Release note (bug fix): Disk write probes during node liveness
heartbeats will no longer get stuck on stalled disks, instead returning
an error once the operation times out. Additionally, disk probes now run
in parallel on nodes with multiple stores.
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

CI failures are #81437.

bors r=tbg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 18, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 18, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 18, 2022

Build succeeded:

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 18, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6568292 to blathers/backport-release-21.2-81133: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

liveness: improve disk probes during node liveness updates

3 participants