liveness: improve disk probes during node liveness updates by erikgrinaker · Pull Request #81133 · cockroachdb/cockroach

erikgrinaker · 2022-05-08T14:08:29Z

When NodeLiveness updates the liveness record (e.g. during
heartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and lose
its leases.

However, this sync write could block indefinitely, and would not respect
the caller's context, which could cause the caller to stall rather than
time out. This in turn could lead to stalls higher up in the stack,
in particular with lease acquisitions that do a synchronous heartbeat.

This patch does the sync write in a separate goroutine in order to
respect the caller's context. The write operation itself will not
(can not) respect the context, and may thus leak a goroutine. However,
concurrent sync writes will coalesce onto an in-flight write.

Additionally, this runs the sync writes in parallel across all disks,
since we can now trivially do so. This may be advantageous on nodes with
many stores, to avoid spurious heartbeat failures under load.

Touches #81100.

Release note (bug fix): Disk write probes during node liveness
heartbeats will no longer get stuck on stalled disks, instead returning
an error once the operation times out. Additionally, disk probes now run
in parallel on nodes with multiple stores.

cockroach-teamcity · 2022-05-08T14:08:40Z

This change is

When `NodeLiveness` updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures that a node with a stalled disk will fail to maintain liveness and lose its leases. However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack, in particular with lease acquisitions that do a synchronous heartbeat. This patch does the sync write in a separate goroutine in order to respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write. Additionally, this runs the sync writes in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes with many stores, to avoid spurious heartbeat failures under load. Release note (bug fix): Disk write probes during node liveness heartbeats will no longer get stuck on stalled disks, instead returning an error once the operation times out. Additionally, disk probes now run in parallel on nodes with multiple stores.

erikgrinaker · 2022-05-18T13:40:22Z

CI failures are #81437.

bors r=tbg

craig · 2022-05-18T15:05:20Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-05-18T17:54:15Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-05-18T20:49:32Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-05-18T20:49:51Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 6568292 to blathers/backport-release-21.2-81133: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

erikgrinaker requested a review from andreimatei May 8, 2022 14:08

erikgrinaker requested review from a team as code owners May 8, 2022 14:08

erikgrinaker self-assigned this May 8, 2022

erikgrinaker added backport-21.2.x labels May 8, 2022

erikgrinaker mentioned this pull request May 9, 2022

kv/liveness: perform sync WAL write to each store in parallel #81147

Closed

tbg approved these changes May 11, 2022

View reviewed changes

erikgrinaker force-pushed the liveness-write-ctx branch 2 times, most recently from 14e0b50 to fbd8937 Compare May 18, 2022 07:43

erikgrinaker force-pushed the liveness-write-ctx branch from fbd8937 to 6568292 Compare May 18, 2022 08:17

craig bot merged commit b948f59 into cockroachdb:master May 18, 2022

blathers-crl bot mentioned this pull request May 18, 2022

release-22.1: liveness: improve disk probes during node liveness updates #81476

Merged

This was referenced May 19, 2022

kvserver: disk stall prevents lease transfer #81100

Closed

release-21.2: liveness: improve disk probes during node liveness updates #81514

Merged

kv/kvserver: TestReplicaDrainLease failed #81511

Closed

nicktrav mentioned this pull request May 23, 2022

record: ensure block.written is accessed atomically cockroachdb/pebble#1719

Closed

lunevalex linked an issue May 24, 2022 that may be closed by this pull request

liveness: improve disk probes during node liveness updates #81786

Closed

nicktrav mentioned this pull request May 26, 2022

db: hold commit pipeline mutex when closing DB cockroachdb/pebble#1728

Closed

erikgrinaker deleted the liveness-write-ctx branch May 28, 2022 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness: improve disk probes during node liveness updates#81133

liveness: improve disk probes during node liveness updates#81133
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:liveness-write-ctx

erikgrinaker commented May 8, 2022 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 8, 2022

Uh oh!

erikgrinaker commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

blathers-crl bot commented May 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

erikgrinaker commented May 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 8, 2022

Uh oh!

erikgrinaker commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

craig bot commented May 18, 2022

Uh oh!

blathers-crl bot commented May 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erikgrinaker commented May 8, 2022 •

edited

Loading