kvserver: don't reject raft snapshots on draining nodes#77246
Conversation
d57a903 to
c7905fd
Compare
c7905fd to
d0d0211
Compare
b064770 to
b3d9f0d
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/store_snapshot.go, line 601 at r1 (raw file):
ctx context.Context, header *kvserverpb.SnapshotRequest_Header, stream incomingSnapshotStream, ) error { // Draining nodes will generally not be rebalanced to (see the filtering that
Should we unify this with the shouldDeclineSnapshot logic we plan to add in #73720? In other words, should we add the method in the first PR we land and add a new case to the method in the second one?
pkg/kv/kvserver/store_snapshot.go, line 614 at r1 (raw file):
// Ensure that if any new snapshot types are ever added, their behavior on // draining receivers will need to be explicitly defined. log.Fatalf(ctx, "unrecognized snapshot type: %v", t)
This feels like a decision we could come to regret. It means that it will take at least an entire release to migrate in a new form of snapshot. We can't predict what a new form of snapshot will look like or what policy we'll want to apply to it, but this feels like it can only hurt. Worse, since this is guarded by IsDraining we might not even catch this until it crashes a customer's cluster.
Given that rejecting rebalance snapshots on a draining node is an optimization but not doing so for raft snapshots is a bug, I think we should allow the snapshot through unless you see a reason why it would be better to reject.
b3d9f0d to
858cefc
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
TFTR
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
pkg/kv/kvserver/store_snapshot.go, line 601 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we unify this with the
shouldDeclineSnapshotlogic we plan to add in #73720? In other words, should we add the method in the first PR we land and add a new case to the method in the second one?
Yep, I'll rebase #73720 over master and just do it in that PR. This is because we're definitely backporting this but I'm not sure if we're so emphatic about backporting #73720.
pkg/kv/kvserver/store_snapshot.go, line 614 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This feels like a decision we could come to regret. It means that it will take at least an entire release to migrate in a new form of snapshot. We can't predict what a new form of snapshot will look like or what policy we'll want to apply to it, but this feels like it can only hurt. Worse, since this is guarded by
IsDrainingwe might not even catch this until it crashes a customer's cluster.Given that rejecting rebalance snapshots on a draining node is an optimization but not doing so for raft snapshots is a bug, I think we should allow the snapshot through unless you see a reason why it would be better to reject.
I hadn't considered the ease of introducing a new snapshot type and what you're saying makes sense. Done.
|
bors r- |
|
Canceled. |
e802064 to
9952b3f
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
I've made a small change to this patch. I'm now using the snapshot Priority instead of using the Type. The difference is that we're now allowing rebalancing snapshots that are sent for recovery purposes in addition to Raft snapshots. See my TODO. I'll wait until you take another look at this @nvanbenschoten.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
9952b3f to
40fb915
Compare
Previously, draining nodes were incorrectly rejecting all snapshots -- including Raft snapshots. This meant that the replicas on those draining nodes that needed Raft snapshots to catch up would never be able to do so. This could've lead to tacit unavailability where, even in cases where all the replicas are live, if a majority is on draining nodes, the range would be stalled. Discovered in https://github.com/cockroachlabs/support/issues/1459 Release justification: bug fix Release note (bug fix): Previously, draining nodes in a cluster without shutting them down could stall foreground traffic in the cluster. This patch fixes this bug.
40fb915 to
f9a8602
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating backport branch refs/heads/blathers/backport-release-21.1-77246: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 21.1.x failed. See errors above. error creating backport branch refs/heads/blathers/backport-release-21.2-77246: POST https://api.github.com/repos/cockroachlabs/cockroach/git/refs: 403 Resource not accessible by integration [] Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously, draining nodes were incorrectly rejecting all snapshots --
including Raft snapshots. This meant that the replicas on those draining nodes
that needed Raft snapshots to catch up would never be able to do so. This
could've lead to tacit unavailability where, even in cases where all the
replicas are live, if a majority is on draining nodes, the range would be
stalled.
Discovered in cockroachlabs/support#1459
Release justification: bug fix
Release note (bug fix): Previously, draining nodes in a cluster without
shutting them down could stall foreground traffic in the cluster. This patch
fixes this bug.