kv: prioritize snapshots to leaseholders, then voters, then learners#80817
kv: prioritize snapshots to leaseholders, then voters, then learners#80817nvb wants to merge 1 commit intocockroachdb:masterfrom
Conversation
kvoli
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
This commit adds prioritization to the `raftSnapshotQueue` so that it sends Raft snapshots to the replicas that are most in need of snapshots first. The prioritization order is to first send snapshots to leaseholder replicas, then voter replicas, then learner replicas. Prioritizing snapshots to leaseholders is an important availability improvement. The leaseholder may be so far behind on its log that it does not even realize that it holds the lease. In such cases, the range is unavailable for reads and writes until the leaseholder receives its snapshot, so send one ASAP. Beyond leaseholders, we prioritize snapshots to voter replicas because a voter in need of a snapshot can not vote for new proposals, so it may be needed to achieve quorum on its range for write availability. There's room to go further here. For instance, we could make a determination about the relative importance of a given voter snapshot based on whether its range is unavailable or not. This commit does not try to do something like this. Release note (bug fix): Raft snapshots to replicas are now prioritized based on their relative importance for availability. Snapshots to leaseholders are given the highest priority, followed by snapshots to voter replicas, followed by snapshots to learner replicas.
b231e66 to
3cf661b
Compare
|
Do we have to worry about getting (in much rarer cases) the problems back that we solved (a long time ago) when we made the queue FIFO (as opposed to LIFO) at equal priorities, i.e. this code? cockroach/pkg/kv/kvserver/queue.go Lines 159 to 163 in bbbfb1a The examples are somewhat contrived but basically snapshots can be necessary to break dependency edges between ranges. r1/1 r1/2 r1/3 splits off r2/1 r2/2 r2/3 r2/3 is slow to apply the snapshot. In fact it hasn't even received the log entry. But somehow the lease got transferred to it. So r2 is unavailable until r1/3 applies the split trigger. Follower r1/2 is down. r1/1 has the split applied. Follower r1/3 doesn't have the split yet and needs a snapshot. In this situation, prioritizing the leaseholder snapshot over the follower snapshot is a deadlock, since it will fail deterministically, until the voter-snapshot for r1/3 has been issued. Sure, this example is contrived, but are there others? Ideally we'd have some robust way of trying to send a snapshot to the leaseholder once, and if that fails, not prioritize it in the same way again to avoid starvation. |
|
This is a good point. I was concerned that any loss of fairness (i.e. prioritization) could lead to starvation, but could not construct a scenario.
I'd like to understand this case further. Ideally, there would not be dependency edges from one range/replica on the availability of another. Why does the snapshot to r2/3 fail deterministically? Does this have to do with the |
|
A snapshot for r2/3 overlaps with the (pre-split) replica r1/3, which is not replicaGC'able (it is still in the descriptor, just needs to catch up). |
|
Ah, that's right. The logic lives in I see why we have this logic for the case where the pre-split replica can catch up on its log. However, it seems like a meaningful limitation with no benefit for cases where the pre-split replica needs a snapshot from the LHS of the split. In these cases, we can't avoid two (or more, for multiple splits) snapshots, and it seems unfortunate to have dependencies between snapshots for different ranges. Doing so prevents us from introducing prioritization between snapshots. I'm guessing it also hurts the time to recover a behind node because now some snapshots need to wait for others and there will be blocking/thrashing until the right sender finds the right snapshot to unblock everyone. I'm surprised we haven't seen this in the One option to avoid this issue would be to detect that a pre-split replica needs a snapshot and remove it while allowing the RHS snapshot though. Abstractly, this sounds good. In practice, it runs into the problem that a follower replica doesn't know when it needs a snapshot, only the leader knows. It also runs into the problem that we'd be removing a replica while it is still in the descriptor, which isn't a fundamental limitation, but is also not something we do today. Another option to avoid this form of priority inversion and risk of starvation would be to build and use the dependency graph directly to make prioritization decisions. Concretely, we could pass back the range ID of the overlapping range from type snapshotDeps struct {
deps map[roachpb.RangeID]struct{}
maxPriorityOfDeps float64
}
type snapshotDepGraph struct {
deps map[roachpb.RangeID]snapshotDeps
prereqs map[roachpb.RangeID]map[roachpb.RangeID]struct{}
}When receiving an error from This would help for dependencies whose leader was on the same node. While I think this may actually be the common case because we're dealing with recently split ranges, we still need to have an answer for leaders on different nodes. I don't see any good options that don't require new communication channels. We could have Another option is that we pass the responsibility to escalate the priority of snapshots to the LHS replica in I'm also now seeing that this is a known issue: cockroach/pkg/kv/kvserver/raft_snapshot_queue.go Lines 144 to 162 in 40d8803 One final thought is that the retry story for failed Raft snapshots is naive (or I'm naive and don't understand it). I think it is driven by Raft ticks, which fire every 200ms. Every 5 of those we heartbeat ( |
|
Dealing with these dependency graphs sounds complex. Do we understand why a replica in need of a snapshot got the lease in the first place? |
|
We did determine the reason why we were repeatedly transferring the lease to a replica that was in need of a snapshot. It was due to a bug that was recently fixed on master. See the later comments in https://github.com/cockroachlabs/support/issues/1569. However, even if we're confident we've solved this one case, I still think we should do something here, as this keeps coming up in different forms. There are a number of reasons why we might want to prioritize one snapshot over another:
In each of these cases, prioritizing snapshots to the "right" replicas can mean the difference between an outage that lasts seconds and one that lasts hours. For instance, we have seen multiple instances in the past month where an important snapshot got stuck behind a wave of decommissioning snapshots, causing a much longer outage than necessary. Arguably some of these cases shouldn't be possible. For instance, I'd like for us to be able to make stronger guarantees around a leaseholder never needing a snapshot. To make such a guarantee, we'll need to rethink the lease transfer protocol — the current protocol is non-cooperative and inherently racy. Even if we do think we have certain guarantees, I think it's important that we be able to recover from mistakes as quickly as possible, given the severity of not doing so. I do agree that the dependency graphs sound complex and I'd prefer a simpler solution. I think my preferred solution would be to eliminate all inter-range dependencies so that the recovery of one range never depends on that of another. That's not currently possible, given the way we handle overlapping snapshots. All of this needs more thought. |
That strikes me as the right goal. A long time back (5 years?) we talked about having a prefix so that ranges are guaranteed to have their own keyspace (i.e. conceptually replicas wouldn't "share storage"). In such a world, it would be perfectly legitimate to send a snapshot for the RHS if the pre-split LHS is still in place. Since the KV server routes requests not by key range but by target replica desc, I think everything would just work (there will be some unanticipated problems of course...). But of getting the separate keyspace in place would be a pretty large undertaking. We could be introducing rocksdb-style column families to pebble, and using a column family per replica. When this was last discussed with @cockroachdb/storage , they weren't exactly excited about it 😆 but I wonder if now that a few years have passed, if we can maybe identify other upshots to this approach. For example, separating the column families could play sort of the role of (I think) guards in PebblesDB, i.e. it can give pebble a hint about boundaries that it shouldn't compact across. And perhaps this would dovetail nicely with the disaggregated storage RFC, since the distributed storage would naturally end up keeping per-replica snapshots, which should be desirable. We could of course also add a userspace ( I'm curious if we have any ideas other than those two. (Having a separate pebble instance per Replica is guaranteed to be a non-starter due to the difficulty of all of these pebble instances sharing resources; column families are effectively the solution for that). We could also work around the dependencies on an ad-hoc basis. For example, if a snapshot for the RHS arrives, we could "truncate" the pre-split LHS in-place with an override. That way, it would still "be" the old range, but we wouldn't allow any access to the keyspace that now belongs to the RHS. But this opens many cans of worms - what if a command was in raft and executes on the LHS that needs to write to the RHS, etc, we need to make the RHS portions noops, and then how to do the stats, etc) - it's almost guaranteed to not be the solution we'd ever embark on, but I wanted to mention it. |
|
For a more sane ad-hoc solution to the problem discussed here, if a RHS snapshot is attempted, instead of opaquely saying "has overlap, no can do" we could also return more precise information: the LHS replica is still around and hasn't caught up. The sender could then prioritize the overlapping LHS replica (of which it hopefully still has a copy) for a snapshot. This cannot cause starvation, since the left-hand side replica is initialized (it overlaps our keyspace after all). An initialized range is not waiting for its left-hand side, so this range should be able to receive a snapshot just fine (if it needs one), and that snapshot will unblock the RHS. In fact, the snapshot might end up catching the LHS up across dozens of splits (we see these split cascades, where a range is split 100x but one follower doesn't apply these splits, leading to all 100 RHS replicas on the slow node to be stuck in limbo). So unless I'm missing other ways in which starvation could result from this (something something merges), this seems like a reasonable change. |
RocksDB column families create a separate LSM tree per column family, all of which share the same WAL. So you get atomicity across mutations to different column families (due to the shared WAL), but can configure different LSM heuristics per column family. I'm pretty sure this isn't what you're after here, as having O(num-replicas) LSM trees per store would be bad for a variety of reasons (smaller sstables, more compactions, etc). |
Yeah, you're right that this is not what we want. We'd need to be able to mix overlapping data within an SST, so it starts sounding a lot like pushing the prefix into pebble (where it can maybe be used for explicit decision making, but not sure that buys much). |
This commit adds prioritization to the
raftSnapshotQueueso that it sendsRaft snapshots to the replicas that are most in need of snapshots first. The
prioritization order is to first send snapshots to leaseholder replicas, then
voter replicas, then learner replicas.
Prioritizing snapshots to leaseholders is an important availability improvement.
The leaseholder may be so far behind on its log that it does not even realize
that it holds the lease. In such cases, the range is unavailable for reads and
writes until the leaseholder receives its snapshot, so send one ASAP.
Beyond leaseholders, we prioritize snapshots to voter replicas because a voter
in need of a snapshot can not vote for new proposals, so it may be needed to
achieve quorum on its range for write availability.
There's room to go further here. For instance, we could make a determination
about the relative importance of a given voter snapshot based on whether its
range is unavailable or not. This commit does not try to do something like this.
Release note (bug fix): Raft snapshots to replicas are now prioritized based on
their relative importance for availability. Snapshots to leaseholders are given
the highest priority, followed by snapshots to voter replicas, followed by
snapshots to learner replicas.