Skip to content

kvserver: decline rebalance snapshots on receivers with poor LSM#73720

Open
aayushshah15 wants to merge 3 commits intocockroachdb:masterfrom
aayushshah15:20211211_declineSnapshotsBasedOnL0FileCount
Open

kvserver: decline rebalance snapshots on receivers with poor LSM#73720
aayushshah15 wants to merge 3 commits intocockroachdb:masterfrom
aayushshah15:20211211_declineSnapshotsBasedOnL0FileCount

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Dec 12, 2021

First commit from #77246.

This patch introduces two new cluster settings:

kv.snapshot_decline.read_amp_threshold

server.declined_snapshot_timeout

With this commit, stores with a read amplification level higher than
kv.snapshot_decline.read_amp_threshold will decline all REBALANCE
snapshots. Upon receiving a DECLINED response, the senders of these snapshots
will consider these receivers throttled for server.declined_snapshot_timeout.

This means that stores with poor LSM health will not be considered as valid
candidates for replica rebalancing.

Fixes #73714
Related to #62168

Release justification: This patch adds a tunable guardrail that could prevent
or mitigate cluster instability.

Release note: None

\cc. @cockroachdb/storage

@aayushshah15 aayushshah15 requested a review from a team as a code owner December 12, 2021 03:53
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15
Copy link
Copy Markdown
Contributor Author

@sumeerbhola, I noticed that we already have somewhat analogous cluster settings for admission control. However, I am not sure that we want to tie the admission control behavior and this rebalancing behavior onto the same set of settings. Do you have any thoughts on that?

@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 84925cd to 98116f3 Compare December 12, 2021 04:16
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions on the approach:

  • How much work has been done on the source of the snapshot before it receives a decline. Looks like the Pebble snapshot has been taken, yes? Has much data been read using the iterator?
  • If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

Regarding the kv.snapshot_decline.read_amp_threshold of 1000, at a high-level I think admission control in these cases is a blunt hammer due to the very narrow scope of actions it can take (queue reordering, and delaying). So higher level logic that has more context should kick in before admission control thresholds are reached. That would apply to rebalancing and to #57247.

One possibility would be to make this 1000 read-amp cluster setting be a stop-gap until the allocator chooses the least unhealthy node if all are unhealthy (say with an unhealthy read amp threshold of 15), since I can see why lowering the current setting with this unilateral reject behavior could be risky.

Reviewed 1 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@erikgrinaker
Copy link
Copy Markdown
Contributor

I think this is going to need a version gate, since older nodes won't understand this response and error out.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

Thanks for taking a look!

How much work has been done on the source of the snapshot before it receives a decline. Looks like the Pebble snapshot has been taken, yes? Has much data been read using the iterator?

We only start reading once the receiver store accepts it by acquiring a "reservation" for this snapshot (adhering to a concurrency limit on the receiver side):

if err := stream.Send(&SnapshotRequest{Header: &header}); err != nil {
return err
}
// Wait until we get a response from the server. The recipient may queue us
// (only a limited number of snapshots are allowed concurrently) or flat-out
// reject the snapshot. After the initial message exchange, we'll go and send
// the actual snapshot (if not rejected).
resp, err := stream.Recv()
if err != nil {
storePool.throttle(throttleFailed, err.Error(), to.StoreID)
return err
}

If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.

One possibility would be to make this 1000 read-amp cluster setting be a stop-gap until the allocator chooses the least unhealthy node if all are unhealthy (say with an unhealthy read amp threshold of 15), since I can see why lowering the current setting with this unilateral reject behavior could be risky.

This approach would require every store to advertise its LSM health (probably via gossip) to every other store. The ambiguous bit here is about where this LSM health attribute would rank relative to the other signals that the allocator uses to pick and rank candidate stores for replica rebalancing. We could explore that approach, but it wouldn't be backportable.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

I think this is going to need a version gate, since older nodes won't understand this response and error out.

I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds. So I dont think the version gate gets us anything. Furthermore, we won't be able to backport this patch if we had a version gate.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If every node in the cluster is considered unhealthy and a node(s) is being shutdown, does this mean we could have a loss of quorum? That is, would it be better for the allocator to decide to rebalance to the least unhealthy node, instead of the target unilaterally deciding to reject.

We wouldn't have a loss of quorum, no. If every node in the system was unhealthy, we would effectively stall all replica rebalancing in the cluster until at least one node isn't.

Yeah, I think the point to note here is that this only applies to snapshots with priority REBALANCE, not RECOVERY. So if we lose a node while all nodes are unhealthy, we'll still upreplicate the range.

I did consider that, but I dont think we need it. Older stores that do not understand this response will still throttle the target store, albeit only for 5 seconds.

Yeah -- mixed-version clusters will get log messages about invalid responses though, but maybe that's fine.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):

var DeclinedSnapshotTimeout = settings.RegisterDurationSetting(
	"server.declined_snapshot_timeout",
	"the amount of time to consider the store throttled for up-replication after a declined rebalance snapshot",

This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.


pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):

	return stream.Send(&SnapshotResponse{
		Status:  SnapshotResponse_DECLINED,
		Message: "snapshot decline due to poor LSM health",

nit: "declined"

@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 98116f3 to 6382d4a Compare December 13, 2021 20:15
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking a look, Erik

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/store_pool.go, line 59 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This only applies to rebalancing, not up-replication of under-replicated ranges, right? I wouldn't use the term "up-replication" here then.

done.


pkg/kv/kvserver/store_snapshot.go, line 712 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: "declined"

done.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: (for this backportable approach, and answers to my previous questions)

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

@irfansharif
Copy link
Copy Markdown
Contributor

Bump, we saw #73714 in a recent escalation. Are we planning to get this in? + backport?

@aayushshah15
Copy link
Copy Markdown
Contributor Author

Are we planning to get this in? + backport?

@nvanbenschoten pointed out that this approach is less than desirable since we do this snapshot handshake after creating uninitialized learner replicas. This means that if the snapshot is rejected, those learners have to be rolled back, which is best effort. So the concern here was that we're risking potentially leaking an uninitialized replica, which has to later get cleared up by the replicaGC queue.

As an alternative, we've been considering #74254.

In that escalation, was it just that we suspected #73714 to be making things worse or did we notice something more direct?

@irfansharif
Copy link
Copy Markdown
Contributor

In that escalation, was it just that we suspected #73714 to be making things worse or did we notice something more direct?

It was not the direct cause of the badness, nor did it terribly worsen the problem. We observed a few nodes with high read-amp/misshapen LSMs; we had to drain their leases away to give it breathing room for compactions/reduce foreground impact, + compact the node offline. In the midst of all this we just observed that occasionally replicas were being balanced onto malperforming nodes.

@shralex
Copy link
Copy Markdown
Contributor

shralex commented Jan 8, 2022

@aayushshah15 I think these are complementary. Gossiping more info about node health will help avoid trying to rebalance to unhealthy nodes. But we can't expect gossip to be instantaneous or perfect (even if we had centralized monitoring, there would be some delay) so your PR here is valuable. I suggest to complete this work and merge it.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we would benefit from both approaches. However, as Aayush pointed out, there is risk in landing this first and pushing a release where this is the only change we have. We have seen in recent cluster outages that an excess of uninitialized replicas can be destabilizing. The concern with this change on its own is that a rejected snapshot leaks an uninitialized replica (a small amount of on-disk and in-mem metadata, but non-negligible cost in https://github.com/cockroachlabs/support/issues/1340) and we do nothing to eventually stem the tide of incoming snapshot requests. So we could spin on rebalance attempts indefinitely and amass a very large number of uninitialized replicas.

That said, there are mitigating factors. For one, we have recently made changes that reduce the cost of uninitialized replicas. This PR as implemented does also try to push back on senders by throttling an individual sender by 60 seconds after observing one of these rejections. So that places an upper bound on the accumulation of uninitialized replicas at (num_nodes-1) repls / min. I'd consider bumping server.declined_snapshot.timeout even higher, maybe to 5 minutes.

Between those two factors and the fact that this all only kicks in after read amp 1000 (i.e. the node is already in a very bad state), I think this change will do more good than bad. So I agree that we should proceed with it.

Reviewed 3 of 5 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @erikgrinaker)


pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):

    APPLIED = 2;
    ERROR = 3;
    DECLINED = 5;

A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.


pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):

	if s.shouldDeclineSnapshot(header) {
		return declineSnapshot(stream)

Do we have any testing that hits this code path?


pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):

		&SnapshotResponse{
			Status:  SnapshotResponse_DECLINED,
			Message: "snapshot declined due to read amplification higher than kv.snapshot_decline.read_amp_threshold",

There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from shouldDeclineSnapshot so we can add other policies later on.


pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):

			to, snap, resp.Message)
	case SnapshotResponse_DECLINED:
		// The snapshot was declined by the receiver due to poor LSM health.

nit: Same point about coupling the status and the reason.

@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from 6382d4a to 288a9a8 Compare March 4, 2022 00:59
Previously, draining nodes were incorrectly rejecting all snapshots --
including Raft snapshots. This meant that the replicas on those draining nodes
that needed Raft snapshots to catch up would never be able to do so. This
could've lead to tacit unavailability where, even in cases where all the
replicas are live, if a majority is on draining nodes, the range would be
stalled.

Discovered in https://github.com/cockroachlabs/support/issues/1459

Release justification: bug fix

Release note (bug fix): Previously, draining nodes in a cluster without
shutting them down could stall foreground traffic in the cluster. This patch
fixes this bug.
This reverts commit 2a1857f.

Release note: None

Release justification:
@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 4 times, most recently from f29ae35 to 824f057 Compare March 7, 2022 20:10
@aayushshah15 aayushshah15 requested a review from a team as a code owner March 7, 2022 20:10
@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 2 times, most recently from 6ff44a2 to 5866344 Compare March 7, 2022 21:00
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for letting this PR sit for so long. It's RFAL now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/store_snapshot.go, line 610 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have any testing that hits this code path?

Added.


pkg/kv/kvserver/store_snapshot.go, line 715 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

There could be other reasons to decline snapshots in the future. For instance, we may decline due to CPU saturation. So we shouldn't be coupling the snapshot status with the reason for the snapshot status. Consider returning this message from shouldDeclineSnapshot so we can add other policies later on.

Done.


pkg/kv/kvserver/store_snapshot.go, line 965 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Same point about coupling the status and the reason.

Done.


pkg/kv/kvserver/raft.proto, line 210 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

A lot of this looks like a revert of 2a1857f. Should we revert that revert and then adjust it with some of the policy changes you've made here? That might help avoid some of the compatibility concerns, because that hasn't made it into a release yet. It also comes with some additional testing which would be valuable here.

Done. Note that 2a1857f removed a check to reject snapshots on stores with full disks. That seemed unintentional to me, so I've reintroduced it.

@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 7, 2022
@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch 6 times, most recently from 6579ed2 to f582cf0 Compare March 8, 2022 00:26
This commit introduces two new cluster settings:
```
kv.snapshot_decline.read_amp_threshold

server.declined_snapshot.timeout
```

With this commit, stores with a read amplification level higher than
`kv.snapshot_decline.read_amp_threshold` will decline all `REBALANCE`
snapshots. Upon receiving a `DECLINED` response, the senders of these snapshots
will consider these receivers `throttled` for
`server.declined_snapshot.timeout`.

This means that stores with poor LSM health will not be considered as valid
candidates for replica rebalancing.

Fixes cockroachdb#73714
Related to cockroachdb#62168

Release note: None

Release justification: This patch adds a tunable guardrail that could prevent
or mitigate cluster instability
@aayushshah15 aayushshah15 force-pushed the 20211211_declineSnapshotsBasedOnL0FileCount branch from f582cf0 to 27631b6 Compare March 8, 2022 00:29
@erikgrinaker
Copy link
Copy Markdown
Contributor

I think we should try to get this and/or #77491 in, since we keep seeing instances of high read amp that could plausibly be explained by unrestrained snapshot ingestion. Since this change differentiates between snapshot types, which was a concern in #77491, maybe we should prioritize this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-kv KV Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: store with high read amplification should not be a target of rebalancing

7 participants