kvserver: decline rebalance snapshots on receivers with poor LSM by aayushshah15 · Pull Request #73720 · cockroachdb/cockroach

aayushshah15 · 2021-12-12T03:53:38Z

First commit from #77246.

This patch introduces two new cluster settings:

kv.snapshot_decline.read_amp_threshold

server.declined_snapshot_timeout

With this commit, stores with a read amplification level higher than
kv.snapshot_decline.read_amp_threshold will decline all REBALANCE
snapshots. Upon receiving a DECLINED response, the senders of these snapshots
will consider these receivers throttled for server.declined_snapshot_timeout.

This means that stores with poor LSM health will not be considered as valid
candidates for replica rebalancing.

Fixes #73714
Related to #62168

Release justification: This patch adds a tunable guardrail that could prevent
or mitigate cluster instability.

Release note: None

\cc. @cockroachdb/storage

cockroach-teamcity · 2021-12-12T03:53:44Z

This change is

aayushshah15 · 2021-12-12T03:56:14Z

@sumeerbhola, I noticed that we already have somewhat analogous cluster settings for admission control. However, I am not sure that we want to tie the admission control behavior and this rebalancing behavior onto the same set of settings. Do you have any thoughts on that?

sumeerbhola

I have a few questions on the approach:

How much work has been done on the source of the snapshot before it receives a decline. Looks like the Pebble snapshot has been taken, yes? Has much data been read using the iterator?
If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

Regarding the kv.snapshot_decline.read_amp_threshold of 1000, at a high-level I think admission control in these cases is a blunt hammer due to the very narrow scope of actions it can take (queue reordering, and delaying). So higher level logic that has more context should kick in before admission control thresholds are reached. That would apply to rebalancing and to #57247.

One possibility would be to make this 1000 read-amp cluster setting be a stop-gap until the allocator chooses the least unhealthy node if all are unhealthy (say with an unhealthy read amp threshold of 15), since I can see why lowering the current setting with this unilateral reject behavior could be risky.

Reviewed 1 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

erikgrinaker · 2021-12-13T17:47:11Z

I think this is going to need a version gate, since older nodes won't understand this response and error out.

aayushshah15 · 2021-12-13T18:02:00Z

Thanks for taking a look!

How much work has been done on the source of the snapshot before it receives a decline. Looks like the Pebble snapshot has been taken, yes? Has much data been read using the iterator?

We only start reading once the receiver store accepts it by acquiring a "reservation" for this snapshot (adhering to a concurrency limit on the receiver side):

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 943 to 954 in 98116f3

    
           if err := stream.Send(&SnapshotRequest{Header: &header}); err != nil { 
        
           	return err 
        
           } 
        
           // Wait until we get a response from the server. The recipient may queue us 
        
           // (only a limited number of snapshots are allowed concurrently) or flat-out 
        
           // reject the snapshot. After the initial message exchange, we'll go and send 
        
           // the actual snapshot (if not rejected). 
        
           resp, err := stream.Recv() 
        
           if err != nil { 
        
           	storePool.throttle(throttleFailed, err.Error(), to.StoreID) 
        
           	return err 
        
           }

If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.

One possibility would be to make this 1000 read-amp cluster setting be a stop-gap until the allocator chooses the least unhealthy node if all are unhealthy (say with an unhealthy read amp threshold of 15), since I can see why lowering the current setting with this unilateral reject behavior could be risky.

This approach would require every store to advertise its LSM health (probably via gossip) to every other store. The ambiguous bit here is about where this LSM health attribute would rank relative to the other signals that the allocator uses to pick and rank candidate stores for replica rebalancing. We could explore that approach, but it wouldn't be backportable.

aayushshah15 · 2021-12-13T18:05:01Z

I think this is going to need a version gate, since older nodes won't understand this response and error out.

I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds. So I dont think the version gate gets us anything. Furthermore, we won't be able to backport this patch if we had a version gate.

erikgrinaker

If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.

Yeah, I think the point to note here is that this only applies to snapshots with priority REBALANCE, not RECOVERY. So if we lose a node while all nodes are unhealthy, we'll still upreplicate the range.

I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds.

Yeah -- mixed-version clusters will get log messages about invalid responses though, but maybe that's fine.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):

var DeclinedSnapshotTimeout = settings.RegisterDurationSetting(
	"server.declined_snapshot_timeout",
	"the amount of time to consider the store throttled for up-replication after a declined rebalance snapshot",

This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.

pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):

	return stream.Send(&SnapshotResponse{
		Status:  SnapshotResponse_DECLINED,
		Message: "snapshot decline due to poor LSM health",

nit: "declined"

aayushshah15

thanks for taking a look, Erik

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.

done.

pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: "declined"

done.

sumeerbhola

(for this backportable approach, and answers to my previous questions)

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

irfansharif · 2022-01-06T15:51:03Z

Bump, we saw #73714 in a recent escalation. Are we planning to get this in? + backport?

aayushshah15 · 2022-01-06T16:21:53Z

Are we planning to get this in? + backport?

@nvanbenschoten pointed out that this approach is less than desirable since we do this snapshot handshake after creating uninitialized learner replicas. This means that if the snapshot is rejected, those learners have to be rolled back, which is best effort. So the concern here was that we're risking potentially leaking an uninitialized replica, which has to later get cleared up by the replicaGC queue.

As an alternative, we've been considering #74254.

In that escalation, was it just that we suspected #73714 to be making things worse or did we notice something more direct?

irfansharif · 2022-01-06T16:31:17Z

In that escalation, was it just that we suspected #73714 to be making things worse or did we notice something more direct?

It was not the direct cause of the badness, nor did it terribly worsen the problem. We observed a few nodes with high read-amp/misshapen LSMs; we had to drain their leases away to give it breathing room for compactions/reduce foreground impact, + compact the node offline. In the midst of all this we just observed that occasionally replicas were being balanced onto malperforming nodes.

shralex · 2022-01-08T15:57:52Z

@aayushshah15 I think these are complementary. Gossiping more info about node health will help avoid trying to rebalance to unhealthy nodes. But we can't expect gossip to be instantaneous or perfect (even if we had centralized monitoring, there would be some delay) so your PR here is valuable. I suggest to complete this work and merge it.

nvb

I agree that we would benefit from both approaches. However, as Aayush pointed out, there is risk in landing this first and pushing a release where this is the only change we have. We have seen in recent cluster outages that an excess of uninitialized replicas can be destabilizing. The concern with this change on its own is that a rejected snapshot leaks an uninitialized replica (a small amount of on-disk and in-mem metadata, but non-negligible cost in https://github.com/cockroachlabs/support/issues/1340) and we do nothing to eventually stem the tide of incoming snapshot requests. So we could spin on rebalance attempts indefinitely and amass a very large number of uninitialized replicas.

That said, there are mitigating factors. For one, we have recently made changes that reduce the cost of uninitialized replicas. This PR as implemented does also try to push back on senders by throttling an individual sender by 60 seconds after observing one of these rejections. So that places an upper bound on the accumulation of uninitialized replicas at (num_nodes-1) repls / min. I'd consider bumping server.declined_snapshot.timeout even higher, maybe to 5 minutes.

Between those two factors and the fact that this all only kicks in after read amp 1000 (i.e. the node is already in a very bad state), I think this change will do more good than bad. So I agree that we should proceed with it.

Reviewed 3 of 5 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @erikgrinaker)

pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):

    APPLIED = 2;
    ERROR = 3;
    DECLINED = 5;

A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.

pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):

	if s.shouldDeclineSnapshot(header) {
		return declineSnapshot(stream)

Do we have any testing that hits this code path?

pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):

		&SnapshotResponse{
			Status:  SnapshotResponse_DECLINED,
			Message: "snapshot declined due to read amplification higher than kv.snapshot_decline.read_amp_threshold",

There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from shouldDeclineSnapshot so we can add other policies later on.

pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):

			to, snap, resp.Message)
	case SnapshotResponse_DECLINED:
		// The snapshot was declined by the receiver due to poor LSM health.

nit: Same point about coupling the status and the reason.

Previously, draining nodes were incorrectly rejecting all snapshots -- including Raft snapshots. This meant that the replicas on those draining nodes that needed Raft snapshots to catch up would never be able to do so. This could've lead to tacit unavailability where, even in cases where all the replicas are live, if a majority is on draining nodes, the range would be stalled. Discovered in https://github.com/cockroachlabs/support/issues/1459 Release justification: bug fix Release note (bug fix): Previously, draining nodes in a cluster without shutting them down could stall foreground traffic in the cluster. This patch fixes this bug.

This reverts commit 2a1857f. Release note: None Release justification:

aayushshah15

Sorry for letting this PR sit for so long. It's RFAL now.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)

pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have any testing that hits this code path?

Added.

pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from shouldDeclineSnapshot so we can add other policies later on.

Done.

pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Same point about coupling the status and the reason.

Done.

pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.

Done. Note that 2a1857f removed a check to reject snapshots on stores with full disks. That seemed unintentional to me, so I've reintroduced it.

This commit introduces two new cluster settings: ``` kv.snapshot_decline.read_amp_threshold server.declined_snapshot.timeout ``` With this commit, stores with a read amplification level higher than `kv.snapshot_decline.read_amp_threshold` will decline all `REBALANCE` snapshots. Upon receiving a `DECLINED` response, the senders of these snapshots will consider these receivers `throttled` for `server.declined_snapshot.timeout`. This means that stores with poor LSM health will not be considered as valid candidates for replica rebalancing. Fixes cockroachdb#73714 Related to cockroachdb#62168 Release note: None Release justification: This patch adds a tunable guardrail that could prevent or mitigate cluster instability

erikgrinaker · 2022-03-30T13:34:10Z

I think we should try to get this and/or #77491 in, since we keep seeing instances of high read amp that could plausibly be explained by unrestrained snapshot ingestion. Since this change differentiates between snapshot types, which was a concern in #77491, maybe we should prioritize this?

aayushshah15 requested review from nvb and sumeerbhola December 12, 2021 03:53

aayushshah15 requested a review from a team as a code owner December 12, 2021 03:53

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 84925cd to 98116f3 Compare December 12, 2021 04:16

sumeerbhola reviewed Dec 13, 2021

View reviewed changes

erikgrinaker approved these changes Dec 13, 2021

View reviewed changes

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 98116f3 to 6382d4a Compare December 13, 2021 20:15

aayushshah15 commented Dec 13, 2021

View reviewed changes

sumeerbhola reviewed Dec 14, 2021

View reviewed changes

nvb reviewed Jan 10, 2022

View reviewed changes

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 6382d4a to 288a9a8 Compare March 4, 2022 00:59

aayushshah15 added 2 commits March 3, 2022 20:21

Revert "kvserver: remove snapshot CanDecline flag"

3195795

This reverts commit 2a1857f. Release note: None Release justification:

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 4 times, most recently from f29ae35 to 824f057 Compare March 7, 2022 20:10

aayushshah15 requested a review from a team as a code owner March 7, 2022 20:10

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 2 times, most recently from 6ff44a2 to 5866344 Compare March 7, 2022 21:00

aayushshah15 commented Mar 7, 2022

View reviewed changes

blathers-crl bot added the T-kv KV Team label Mar 7, 2022

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 6 times, most recently from 6579ed2 to f582cf0 Compare March 8, 2022 00:26

aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from f582cf0 to 27631b6 Compare March 8, 2022 00:29

erikgrinaker mentioned this pull request Mar 30, 2022

kv: bump snapshot rates even higher #74695

Open

Conversation

aayushshah15 commented Dec 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 12, 2021

Uh oh!

aayushshah15 commented Dec 12, 2021

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Dec 13, 2021

Uh oh!

aayushshah15 commented Dec 13, 2021

Uh oh!

aayushshah15 commented Dec 13, 2021

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif commented Jan 6, 2022

Uh oh!

aayushshah15 commented Jan 6, 2022

Uh oh!

irfansharif commented Jan 6, 2022

Uh oh!

shralex commented Jan 8, 2022

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

aayushshah15 commented Dec 12, 2021 •

edited

Loading