kvserver: don't reject raft snapshots on draining nodes by aayushshah15 · Pull Request #77246 · cockroachdb/cockroach

aayushshah15 · 2022-03-01T21:29:34Z

Previously, draining nodes were incorrectly rejecting all snapshots --
including Raft snapshots. This meant that the replicas on those draining nodes
that needed Raft snapshots to catch up would never be able to do so. This
could've lead to tacit unavailability where, even in cases where all the
replicas are live, if a majority is on draining nodes, the range would be
stalled.

Discovered in cockroachlabs/support#1459

Release justification: bug fix

Release note (bug fix): Previously, draining nodes in a cluster without
shutting them down could stall foreground traffic in the cluster. This patch
fixes this bug.

cockroach-teamcity · 2022-03-01T21:29:43Z

This change is

nvb

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/store_snapshot.go, line 601 at r1 (raw file):

	ctx context.Context, header *kvserverpb.SnapshotRequest_Header, stream incomingSnapshotStream,
) error {
	// Draining nodes will generally not be rebalanced to (see the filtering that

Should we unify this with the shouldDeclineSnapshot logic we plan to add in #73720? In other words, should we add the method in the first PR we land and add a new case to the method in the second one?

pkg/kv/kvserver/store_snapshot.go, line 614 at r1 (raw file):

			// Ensure that if any new snapshot types are ever added, their behavior on
			// draining receivers will need to be explicitly defined.
			log.Fatalf(ctx, "unrecognized snapshot type: %v", t)

This feels like a decision we could come to regret. It means that it will take at least an entire release to migrate in a new form of snapshot. We can't predict what a new form of snapshot will look like or what policy we'll want to apply to it, but this feels like it can only hurt. Worse, since this is guarded by IsDraining we might not even catch this until it crashes a customer's cluster.

Given that rejecting rebalance snapshots on a draining node is an optimization but not doing so for raft snapshots is a bug, I think we should allow the snapshot through unless you see a reason why it would be better to reject.

aayushshah15

TFTR

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

pkg/kv/kvserver/store_snapshot.go, line 601 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we unify this with the shouldDeclineSnapshot logic we plan to add in #73720? In other words, should we add the method in the first PR we land and add a new case to the method in the second one?

Yep, I'll rebase #73720 over master and just do it in that PR. This is because we're definitely backporting this but I'm not sure if we're so emphatic about backporting #73720.

pkg/kv/kvserver/store_snapshot.go, line 614 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This feels like a decision we could come to regret. It means that it will take at least an entire release to migrate in a new form of snapshot. We can't predict what a new form of snapshot will look like or what policy we'll want to apply to it, but this feels like it can only hurt. Worse, since this is guarded by IsDraining we might not even catch this until it crashes a customer's cluster.

Given that rejecting rebalance snapshots on a draining node is an optimization but not doing so for raft snapshots is a bug, I think we should allow the snapshot through unless you see a reason why it would be better to reject.

I hadn't considered the ease of introducing a new snapshot type and what you're saying makes sense. Done.

aayushshah15 · 2022-03-04T01:12:59Z

bors r-

craig · 2022-03-04T01:13:01Z

Canceled.

aayushshah15

I've made a small change to this patch. I'm now using the snapshot Priority instead of using the Type. The difference is that we're now allowing rebalancing snapshots that are sent for recovery purposes in addition to Raft snapshots. See my TODO. I'll wait until you take another look at this @nvanbenschoten.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

Previously, draining nodes were incorrectly rejecting all snapshots -- including Raft snapshots. This meant that the replicas on those draining nodes that needed Raft snapshots to catch up would never be able to do so. This could've lead to tacit unavailability where, even in cases where all the replicas are live, if a majority is on draining nodes, the range would be stalled. Discovered in https://github.com/cockroachlabs/support/issues/1459 Release justification: bug fix Release note (bug fix): Previously, draining nodes in a cluster without shutting them down could stall foreground traffic in the cluster. This patch fixes this bug.

nvb

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

aayushshah15 · 2022-03-07T23:09:32Z

bors r+

craig · 2022-03-08T01:22:34Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-03-08T01:22:39Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating backport branch refs/heads/blathers/backport-release-21.1-77246: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 21.1.x failed. See errors above.

error creating backport branch refs/heads/blathers/backport-release-21.2-77246: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration []

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch from d57a903 to c7905fd Compare March 1, 2022 22:08

aayushshah15 marked this pull request as ready for review March 1, 2022 22:09

aayushshah15 requested a review from a team as a code owner March 1, 2022 22:09

aayushshah15 requested a review from nvb March 1, 2022 22:09

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch from c7905fd to d0d0211 Compare March 1, 2022 22:11

aayushshah15 added backport-21.1.x labels Mar 1, 2022

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch 2 times, most recently from b064770 to b3d9f0d Compare March 1, 2022 23:12

nvb approved these changes Mar 3, 2022

View reviewed changes

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch from b3d9f0d to 858cefc Compare March 3, 2022 23:25

aayushshah15 commented Mar 3, 2022

View reviewed changes

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch 3 times, most recently from e802064 to 9952b3f Compare March 4, 2022 01:21

aayushshah15 commented Mar 4, 2022

View reviewed changes

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch from 9952b3f to 40fb915 Compare March 4, 2022 01:28

aayushshah15 force-pushed the 20220301_dontRejectRaftSnapshotsOnDrainingNodes branch from 40fb915 to f9a8602 Compare March 4, 2022 01:29

aayushshah15 mentioned this pull request Mar 7, 2022

kvserver: decline rebalance snapshots on receivers with poor LSM #73720

Open

blathers-crl bot added the T-kv KV Team label Mar 7, 2022

nvb approved these changes Mar 7, 2022

View reviewed changes

craig bot merged commit 98a66b5 into cockroachdb:master Mar 8, 2022

aayushshah15 mentioned this pull request Mar 8, 2022

release-21.2: kvserver: don't reject raft snapshots on draining nodes #77490

Merged

aayushshah15 mentioned this pull request Mar 8, 2022

release-21.1: kvserver: don't reject raft snapshots on draining nodes #77494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: don't reject raft snapshots on draining nodes#77246

kvserver: don't reject raft snapshots on draining nodes#77246
craig[bot] merged 1 commit intocockroachdb:masterfrom
aayushshah15:20220301_dontRejectRaftSnapshotsOnDrainingNodes

aayushshah15 commented Mar 1, 2022 •

edited

Loading

Uh oh!

cockroach-teamcity commented Mar 1, 2022

Uh oh!

nvb left a comment

Uh oh!

aayushshah15 left a comment

Uh oh!

aayushshah15 commented Mar 4, 2022

Uh oh!

craig bot commented Mar 4, 2022

Uh oh!

aayushshah15 left a comment

Uh oh!

nvb left a comment

Uh oh!

aayushshah15 commented Mar 7, 2022

Uh oh!

craig bot commented Mar 8, 2022

Uh oh!

blathers-crl bot commented Mar 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aayushshah15 commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 1, 2022

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 commented Mar 4, 2022

Uh oh!

craig bot commented Mar 4, 2022

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 commented Mar 7, 2022

Uh oh!

craig bot commented Mar 8, 2022

Uh oh!

blathers-crl bot commented Mar 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aayushshah15 commented Mar 1, 2022 •

edited

Loading