kv: prioritize snapshots to leaseholders, then voters, then learners by nvb · Pull Request #80817 · cockroachdb/cockroach

nvb · 2022-04-30T01:33:58Z

This commit adds prioritization to the raftSnapshotQueue so that it sends
Raft snapshots to the replicas that are most in need of snapshots first. The
prioritization order is to first send snapshots to leaseholder replicas, then
voter replicas, then learner replicas.

Prioritizing snapshots to leaseholders is an important availability improvement.
The leaseholder may be so far behind on its log that it does not even realize
that it holds the lease. In such cases, the range is unavailable for reads and
writes until the leaseholder receives its snapshot, so send one ASAP.

Beyond leaseholders, we prioritize snapshots to voter replicas because a voter
in need of a snapshot can not vote for new proposals, so it may be needed to
achieve quorum on its range for write availability.

There's room to go further here. For instance, we could make a determination
about the relative importance of a given voter snapshot based on whether its
range is unavailable or not. This commit does not try to do something like this.

Release note (bug fix): Raft snapshots to replicas are now prioritized based on
their relative importance for availability. Snapshots to leaseholders are given
the highest priority, followed by snapshots to voter replicas, followed by
snapshots to learner replicas.

cockroach-teamcity · 2022-04-30T01:34:06Z

This change is

kvoli

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

This commit adds prioritization to the `raftSnapshotQueue` so that it sends Raft snapshots to the replicas that are most in need of snapshots first. The prioritization order is to first send snapshots to leaseholder replicas, then voter replicas, then learner replicas. Prioritizing snapshots to leaseholders is an important availability improvement. The leaseholder may be so far behind on its log that it does not even realize that it holds the lease. In such cases, the range is unavailable for reads and writes until the leaseholder receives its snapshot, so send one ASAP. Beyond leaseholders, we prioritize snapshots to voter replicas because a voter in need of a snapshot can not vote for new proposals, so it may be needed to achieve quorum on its range for write availability. There's room to go further here. For instance, we could make a determination about the relative importance of a given voter snapshot based on whether its range is unavailable or not. This commit does not try to do something like this. Release note (bug fix): Raft snapshots to replicas are now prioritized based on their relative importance for availability. Snapshots to leaseholders are given the highest priority, followed by snapshots to voter replicas, followed by snapshots to learner replicas.

tbg · 2022-05-02T11:49:36Z

Do we have to worry about getting (in much rarer cases) the problems back that we solved (a long time ago) when we made the queue FIFO (as opposed to LIFO) at equal priorities, i.e. this code?

cockroach/pkg/kv/kvserver/queue.go

Lines 159 to 163 in bbbfb1a

    
           if a.priority == b.priority { 
        
           	// When priorities are equal, we want the lower sequence number to show 
        
           	// up first (FIFO). 
        
           	return a.seq < b.seq 
        
           }

?

The examples are somewhat contrived but basically snapshots can be necessary to break dependency edges between ranges.

r1/1 r1/2 r1/3 splits off r2/1 r2/2 r2/3

r2/3 is slow to apply the snapshot. In fact it hasn't even received the log entry. But somehow the lease got transferred to it. So r2 is unavailable until r1/3 applies the split trigger.

Follower r1/2 is down. r1/1 has the split applied. Follower r1/3 doesn't have the split yet and needs a snapshot.

In this situation, prioritizing the leaseholder snapshot over the follower snapshot is a deadlock, since it will fail deterministically, until the voter-snapshot for r1/3 has been issued.

Sure, this example is contrived, but are there others? Ideally we'd have some robust way of trying to send a snapshot to the leaseholder once, and if that fails, not prioritize it in the same way again to avoid starvation.

nvb · 2022-05-02T15:41:03Z

This is a good point. I was concerned that any loss of fairness (i.e. prioritization) could lead to starvation, but could not construct a scenario.

In this situation, prioritizing the leaseholder snapshot over the follower snapshot is a deadlock, since it will fail deterministically, until the voter-snapshot for r1/3 has been issued.

I'd like to understand this case further. Ideally, there would not be dependency edges from one range/replica on the availability of another. Why does the snapshot to r2/3 fail deterministically? Does this have to do with the splitDelayHelper?

tbg · 2022-05-06T15:44:34Z

A snapshot for r2/3 overlaps with the (pre-split) replica r1/3, which is not replicaGC'able (it is still in the descriptor, just needs to catch up).

nvb · 2022-05-06T17:29:03Z

Ah, that's right. The logic lives in Store.checkSnapshotOverlapLocked.

I see why we have this logic for the case where the pre-split replica can catch up on its log. However, it seems like a meaningful limitation with no benefit for cases where the pre-split replica needs a snapshot from the LHS of the split. In these cases, we can't avoid two (or more, for multiple splits) snapshots, and it seems unfortunate to have dependencies between snapshots for different ranges. Doing so prevents us from introducing prioritization between snapshots. I'm guessing it also hurts the time to recover a behind node because now some snapshots need to wait for others and there will be blocking/thrashing until the right sender finds the right snapshot to unblock everyone.

I'm surprised we haven't seen this in the splits/largerange/size=32GiB,nodes=6 roachtest. Or maybe we have and I'm just not seeing it in the handful of times that you have debugged the test.

One option to avoid this issue would be to detect that a pre-split replica needs a snapshot and remove it while allowing the RHS snapshot though. Abstractly, this sounds good. In practice, it runs into the problem that a follower replica doesn't know when it needs a snapshot, only the leader knows. It also runs into the problem that we'd be removing a replica while it is still in the descriptor, which isn't a fundamental limitation, but is also not something we do today.

Another option to avoid this form of priority inversion and risk of starvation would be to build and use the dependency graph directly to make prioritization decisions. Concretely, we could pass back the range ID of the overlapping range from checkSnapshotOverlapLocked to the sender's raftSnapshotQueue. After doing so, the raftSnapshotQueue could conduct a graph that might look something like:

type snapshotDeps struct {
    deps              map[roachpb.RangeID]struct{}
    maxPriorityOfDeps float64
}

type snapshotDepGraph struct {
    deps    map[roachpb.RangeID]snapshotDeps
    prereqs map[roachpb.RangeID]map[roachpb.RangeID]struct{}
}

When receiving an error from checkSnapshotOverlapLocked, the raftSnapshotQueue could update this graph to include the current range's dependency. It could also enqueue the dependency and use the graph to assign a priority to the dependency's snapshot of max(priority(repl), maxPriorityOfDeps + 1)).

This would help for dependencies whose leader was on the same node. While I think this may actually be the common case because we're dealing with recently split ranges, we still need to have an answer for leaders on different nodes. I don't see any good options that don't require new communication channels. We could have raftSnapshotQueues communicate.

Another option is that we pass the responsibility to escalate the priority of snapshots to the LHS replica in checkSnapshotOverlapLocked. That replica could be told that "this snapshot with priority P is being delayed because of you, learn about the split in your near future with priority P+1". It could then communicate that information to its leader, who could feed this into the raftSnapshotQueue. In the case of transitive dependencies, this could get passed up the chain of splits. A benefit of this approach is that lifetimes in this priority scheme become clear (remember higher priority until snapshot succeeds), and we may be able to use the existing heartbeat channel to pass information.

I'm also now seeing that this is a known issue:

cockroach/pkg/kv/kvserver/raft_snapshot_queue.go

Lines 144 to 162 in 40d8803

    
           // NB: if the snapshot fails because of an overlapping replica on the 
        
           // recipient which is also waiting for a snapshot, the "smart" thing is to 
        
           // send that other snapshot with higher priority. The problem is that the 
        
           // leader for the overlapping range may not be this node. This happens 
        
           // typically during splits and merges when overly aggressive log truncations 
        
           // occur. 
        
           // 
        
           // For splits, the overlapping range will be a replica of the pre-split 
        
           // range that needs a snapshot to catch it up across the split trigger. 
        
           // 
        
           // For merges, the overlapping replicas belong to ranges since subsumed by 
        
           // this range. In particular, there can be many of them if merges apply in 
        
           // rapid succession. The leftmost replica is the most important one to catch 
        
           // up, as it will absorb all of the overlapping replicas when caught up past 
        
           // all of the merges. 
        
           // 
        
           // We're currently not handling this and instead rely on the quota pool to 
        
           // make sure that log truncations won't require snapshots for healthy 
        
           // followers.

One final thought is that the retry story for failed Raft snapshots is naive (or I'm naive and don't understand it). I think it is driven by Raft ticks, which fire every 200ms. Every 5 of those we heartbeat (defaultRaftHeartbeatIntervalTicks). In response to a heartbeat, we may send a MsgSnap, which enqueued into the raftSnapshotQueue. I wonder whether there is a role for the snapshot queue's purgatory for snapshots that we know are going to fail.

erikgrinaker · 2022-05-12T11:50:59Z

Dealing with these dependency graphs sounds complex. Do we understand why a replica in need of a snapshot got the lease in the first place?

nvb · 2022-05-17T05:09:36Z

We did determine the reason why we were repeatedly transferring the lease to a replica that was in need of a snapshot. It was due to a bug that was recently fixed on master. See the later comments in https://github.com/cockroachlabs/support/issues/1569.

However, even if we're confident we've solved this one case, I still think we should do something here, as this keeps coming up in different forms. There are a number of reasons why we might want to prioritize one snapshot over another:

leaseholder vs. not leaseholder
system range vs. data range
causing unavailability vs. not
voter vs. learner
raft snapshot vs. upreplication snapshot

In each of these cases, prioritizing snapshots to the "right" replicas can mean the difference between an outage that lasts seconds and one that lasts hours. For instance, we have seen multiple instances in the past month where an important snapshot got stuck behind a wave of decommissioning snapshots, causing a much longer outage than necessary.

Arguably some of these cases shouldn't be possible. For instance, I'd like for us to be able to make stronger guarantees around a leaseholder never needing a snapshot. To make such a guarantee, we'll need to rethink the lease transfer protocol — the current protocol is non-cooperative and inherently racy. Even if we do think we have certain guarantees, I think it's important that we be able to recover from mistakes as quickly as possible, given the severity of not doing so.

I do agree that the dependency graphs sound complex and I'd prefer a simpler solution. I think my preferred solution would be to eliminate all inter-range dependencies so that the recovery of one range never depends on that of another. That's not currently possible, given the way we handle overlapping snapshots.

All of this needs more thought.

tbg · 2022-05-17T08:38:04Z

I think my preferred solution would be to eliminate all inter-range dependencies so that the recovery of one range never depends on that of another.

That strikes me as the right goal. A long time back (5 years?) we talked about having a prefix so that ranges are guaranteed to have their own keyspace (i.e. conceptually replicas wouldn't "share storage"). In such a world, it would be perfectly legitimate to send a snapshot for the RHS if the pre-split LHS is still in place. Since the KV server routes requests not by key range but by target replica desc, I think everything would just work (there will be some unanticipated problems of course...). But of getting the separate keyspace in place would be a pretty large undertaking.

We could be introducing rocksdb-style column families to pebble, and using a column family per replica. When this was last discussed with @cockroachdb/storage , they weren't exactly excited about it 😆 but I wonder if now that a few years have passed, if we can maybe identify other upshots to this approach. For example, separating the column families could play sort of the role of (I think) guards in PebblesDB, i.e. it can give pebble a hint about boundaries that it shouldn't compact across. And perhaps this would dovetail nicely with the disaggregated storage RFC, since the distributed storage would naturally end up keeping per-replica snapshots, which should be desirable.

We could of course also add a userspace (kvserver) prefix, say the <rangeID,replicaID> pair, which is simpler in terms of technology but a lot more leaky.

I'm curious if we have any ideas other than those two. (Having a separate pebble instance per Replica is guaranteed to be a non-starter due to the difficulty of all of these pebble instances sharing resources; column families are effectively the solution for that).

We could also work around the dependencies on an ad-hoc basis. For example, if a snapshot for the RHS arrives, we could "truncate" the pre-split LHS in-place with an override. That way, it would still "be" the old range, but we wouldn't allow any access to the keyspace that now belongs to the RHS. But this opens many cans of worms - what if a command was in raft and executes on the LHS that needs to write to the RHS, etc, we need to make the RHS portions noops, and then how to do the stats, etc) - it's almost guaranteed to not be the solution we'd ever embark on, but I wanted to mention it.

tbg · 2022-05-17T08:46:25Z

For a more sane ad-hoc solution to the problem discussed here, if a RHS snapshot is attempted, instead of opaquely saying "has overlap, no can do" we could also return more precise information: the LHS replica is still around and hasn't caught up. The sender could then prioritize the overlapping LHS replica (of which it hopefully still has a copy) for a snapshot. This cannot cause starvation, since the left-hand side replica is initialized (it overlaps our keyspace after all). An initialized range is not waiting for its left-hand side, so this range should be able to receive a snapshot just fine (if it needs one), and that snapshot will unblock the RHS. In fact, the snapshot might end up catching the LHS up across dozens of splits (we see these split cascades, where a range is split 100x but one follower doesn't apply these splits, leading to all 100 RHS replicas on the slow node to be stuck in limbo). So unless I'm missing other ways in which starvation could result from this (something something merges), this seems like a reasonable change.

petermattis · 2022-05-17T12:21:14Z

We could be introducing rocksdb-style column families to pebble, and using a column family per replica. When this was last discussed with https://github.com/orgs/cockroachdb/teams/storage , they weren't exactly excited about it 😆 but I wonder if now that a few years have passed, if we can maybe identify other upshots to this approach.

RocksDB column families create a separate LSM tree per column family, all of which share the same WAL. So you get atomicity across mutations to different column families (due to the shared WAL), but can configure different LSM heuristics per column family. I'm pretty sure this isn't what you're after here, as having O(num-replicas) LSM trees per store would be bad for a variety of reasons (smaller sstables, more compactions, etc).

tbg · 2022-05-17T12:50:56Z

RocksDB column families create a separate LSM tree per column family, all of which share the same WAL. So you get atomicity across mutations to different column families (due to the shared WAL), but can configure different LSM heuristics per column family. I'm pretty sure this isn't what you're after here, as having O(num-replicas) LSM trees per store would be bad for a variety of reasons (smaller sstables, more compactions, etc).

Yeah, you're right that this is not what we want. We'd need to be able to mix overlapping data within an SST, so it starts sounding a lot like pushing the prefix into pebble (where it can maybe be used for explicit decision making, but not sure that buys much).

nvb requested review from erikgrinaker and kvoli April 30, 2022 01:33

nvb requested a review from a team as a code owner April 30, 2022 01:33

kvoli approved these changes Apr 30, 2022

View reviewed changes

nvb force-pushed the nvanbenschoten/preferLeaseholderSnapshot branch from b231e66 to 3cf661b Compare May 2, 2022 03:50

nvb mentioned this pull request May 19, 2022

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Closed

2 tasks

erikgrinaker removed their request for review November 8, 2022 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: prioritize snapshots to leaseholders, then voters, then learners#80817

kv: prioritize snapshots to leaseholders, then voters, then learners#80817
nvb wants to merge 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/preferLeaseholderSnapshot

nvb commented Apr 30, 2022

Uh oh!

cockroach-teamcity commented Apr 30, 2022

Uh oh!

kvoli left a comment

Uh oh!

tbg commented May 2, 2022

Uh oh!

nvb commented May 2, 2022

Uh oh!

tbg commented May 6, 2022

Uh oh!

nvb commented May 6, 2022

Uh oh!

erikgrinaker commented May 12, 2022

Uh oh!

nvb commented May 17, 2022

Uh oh!

tbg commented May 17, 2022

Uh oh!

tbg commented May 17, 2022 •

edited

Loading

Uh oh!

petermattis commented May 17, 2022

Uh oh!

tbg commented May 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nvb commented Apr 30, 2022

Uh oh!

cockroach-teamcity commented Apr 30, 2022

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented May 2, 2022

Uh oh!

nvb commented May 2, 2022

Uh oh!

tbg commented May 6, 2022

Uh oh!

nvb commented May 6, 2022

Uh oh!

erikgrinaker commented May 12, 2022

Uh oh!

nvb commented May 17, 2022

Uh oh!

tbg commented May 17, 2022

Uh oh!

tbg commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented May 17, 2022

Uh oh!

tbg commented May 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tbg commented May 17, 2022 •

edited

Loading