kvserver: decline rebalance snapshots on receivers with poor LSM#73720
kvserver: decline rebalance snapshots on receivers with poor LSM#73720aayushshah15 wants to merge 3 commits intocockroachdb:masterfrom
Conversation
|
@sumeerbhola, I noticed that we already have somewhat analogous cluster settings for admission control. However, I am not sure that we want to tie the admission control behavior and this rebalancing behavior onto the same set of settings. Do you have any thoughts on that? |
84925cd to
98116f3
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
I have a few questions on the approach:
- How much work has been done on the source of the snapshot before it receives a decline. Looks like the Pebble snapshot has been taken, yes? Has much data been read using the iterator?
- If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.
Regarding the kv.snapshot_decline.read_amp_threshold of 1000, at a high-level I think admission control in these cases is a blunt hammer due to the very narrow scope of actions it can take (queue reordering, and delaying). So higher level logic that has more context should kick in before admission control thresholds are reached. That would apply to rebalancing and to #57247.
One possibility would be to make this 1000 read-amp cluster setting be a stop-gap until the allocator chooses the least unhealthy node if all are unhealthy (say with an unhealthy read amp threshold of 15), since I can see why lowering the current setting with this unilateral reject behavior could be risky.
Reviewed 1 of 5 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
|
I think this is going to need a version gate, since older nodes won't understand this response and error out. |
|
Thanks for taking a look!
We only start reading once the receiver store accepts it by acquiring a "reservation" for this snapshot (adhering to a concurrency limit on the receiver side): cockroach/pkg/kv/kvserver/store_snapshot.go Lines 943 to 954 in 98116f3
We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.
This approach would require every store to advertise its LSM health (probably via gossip) to every other store. The ambiguous bit here is about where this LSM health attribute would rank relative to the other signals that the allocator uses to pick and rank candidate stores for replica rebalancing. We could explore that approach, but it wouldn't be backportable. |
I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds. So I dont think the version gate gets us anything. Furthermore, we won't be able to backport this patch if we had a version gate. |
erikgrinaker
left a comment
There was a problem hiding this comment.
If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.
We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.
Yeah, I think the point to note here is that this only applies to snapshots with priority REBALANCE, not RECOVERY. So if we lose a node while all nodes are unhealthy, we'll still upreplicate the range.
I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds.
Yeah -- mixed-version clusters will get log messages about invalid responses though, but maybe that's fine.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):
var DeclinedSnapshotTimeout = settings.RegisterDurationSetting( "server.declined_snapshot_timeout", "the amount of time to consider the store throttled for up-replication after a declined rebalance snapshot",
This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.
pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):
return stream.Send(&SnapshotResponse{ Status: SnapshotResponse_DECLINED, Message: "snapshot decline due to poor LSM health",
nit: "declined"
98116f3 to
6382d4a
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
thanks for taking a look, Erik
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.
done.
pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: "declined"
done.
sumeerbhola
left a comment
There was a problem hiding this comment.
(for this backportable approach, and answers to my previous questions)
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
|
Bump, we saw #73714 in a recent escalation. Are we planning to get this in? + backport? |
@nvanbenschoten pointed out that this approach is less than desirable since we do this snapshot handshake after creating uninitialized learner replicas. This means that if the snapshot is rejected, those learners have to be rolled back, which is best effort. So the concern here was that we're risking potentially leaking an uninitialized replica, which has to later get cleared up by the replicaGC queue. As an alternative, we've been considering #74254. In that escalation, was it just that we suspected #73714 to be making things worse or did we notice something more direct? |
It was not the direct cause of the badness, nor did it terribly worsen the problem. We observed a few nodes with high read-amp/misshapen LSMs; we had to drain their leases away to give it breathing room for compactions/reduce foreground impact, + compact the node offline. In the midst of all this we just observed that occasionally replicas were being balanced onto malperforming nodes. |
|
@aayushshah15 I think these are complementary. Gossiping more info about node health will help avoid trying to rebalance to unhealthy nodes. But we can't expect gossip to be instantaneous or perfect (even if we had centralized monitoring, there would be some delay) so your PR here is valuable. I suggest to complete this work and merge it. |
nvb
left a comment
There was a problem hiding this comment.
I agree that we would benefit from both approaches. However, as Aayush pointed out, there is risk in landing this first and pushing a release where this is the only change we have. We have seen in recent cluster outages that an excess of uninitialized replicas can be destabilizing. The concern with this change on its own is that a rejected snapshot leaks an uninitialized replica (a small amount of on-disk and in-mem metadata, but non-negligible cost in https://github.com/cockroachlabs/support/issues/1340) and we do nothing to eventually stem the tide of incoming snapshot requests. So we could spin on rebalance attempts indefinitely and amass a very large number of uninitialized replicas.
That said, there are mitigating factors. For one, we have recently made changes that reduce the cost of uninitialized replicas. This PR as implemented does also try to push back on senders by throttling an individual sender by 60 seconds after observing one of these rejections. So that places an upper bound on the accumulation of uninitialized replicas at (num_nodes-1) repls / min. I'd consider bumping server.declined_snapshot.timeout even higher, maybe to 5 minutes.
Between those two factors and the fact that this all only kicks in after read amp 1000 (i.e. the node is already in a very bad state), I think this change will do more good than bad. So I agree that we should proceed with it.
Reviewed 3 of 5 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @erikgrinaker)
pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):
APPLIED = 2; ERROR = 3; DECLINED = 5;
A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.
pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):
if s.shouldDeclineSnapshot(header) { return declineSnapshot(stream)
Do we have any testing that hits this code path?
pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):
&SnapshotResponse{ Status: SnapshotResponse_DECLINED, Message: "snapshot declined due to read amplification higher than kv.snapshot_decline.read_amp_threshold",
There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from shouldDeclineSnapshot so we can add other policies later on.
pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):
to, snap, resp.Message) case SnapshotResponse_DECLINED: // The snapshot was declined by the receiver due to poor LSM health.
nit: Same point about coupling the status and the reason.
6382d4a to
288a9a8
Compare
Previously, draining nodes were incorrectly rejecting all snapshots -- including Raft snapshots. This meant that the replicas on those draining nodes that needed Raft snapshots to catch up would never be able to do so. This could've lead to tacit unavailability where, even in cases where all the replicas are live, if a majority is on draining nodes, the range would be stalled. Discovered in https://github.com/cockroachlabs/support/issues/1459 Release justification: bug fix Release note (bug fix): Previously, draining nodes in a cluster without shutting them down could stall foreground traffic in the cluster. This patch fixes this bug.
This reverts commit 2a1857f. Release note: None Release justification:
f29ae35 to
824f057
Compare
6ff44a2 to
5866344
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Sorry for letting this PR sit for so long. It's RFAL now.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we have any testing that hits this code path?
Added.
pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from
shouldDeclineSnapshotso we can add other policies later on.
Done.
pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: Same point about coupling the status and the reason.
Done.
pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.
Done. Note that 2a1857f removed a check to reject snapshots on stores with full disks. That seemed unintentional to me, so I've reintroduced it.
6579ed2 to
f582cf0
Compare
This commit introduces two new cluster settings: ``` kv.snapshot_decline.read_amp_threshold server.declined_snapshot.timeout ``` With this commit, stores with a read amplification level higher than `kv.snapshot_decline.read_amp_threshold` will decline all `REBALANCE` snapshots. Upon receiving a `DECLINED` response, the senders of these snapshots will consider these receivers `throttled` for `server.declined_snapshot.timeout`. This means that stores with poor LSM health will not be considered as valid candidates for replica rebalancing. Fixes cockroachdb#73714 Related to cockroachdb#62168 Release note: None Release justification: This patch adds a tunable guardrail that could prevent or mitigate cluster instability
f582cf0 to
27631b6
Compare
First commit from #77246.
This patch introduces two new cluster settings:
With this commit, stores with a read amplification level higher than
kv.snapshot_decline.read_amp_thresholdwill decline allREBALANCEsnapshots. Upon receiving a
DECLINEDresponse, the senders of these snapshotswill consider these receivers
throttledforserver.declined_snapshot_timeout.This means that stores with poor LSM health will not be considered as valid
candidates for replica rebalancing.
Fixes #73714
Related to #62168
Release justification: This patch adds a tunable guardrail that could prevent
or mitigate cluster instability.
Release note: None
\cc. @cockroachdb/storage