Skip to content

kv: prioritize snapshots to leaseholders, then voters, then learners#80817

Open
nvb wants to merge 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/preferLeaseholderSnapshot
Open

kv: prioritize snapshots to leaseholders, then voters, then learners#80817
nvb wants to merge 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/preferLeaseholderSnapshot

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Apr 30, 2022

This commit adds prioritization to the raftSnapshotQueue so that it sends
Raft snapshots to the replicas that are most in need of snapshots first. The
prioritization order is to first send snapshots to leaseholder replicas, then
voter replicas, then learner replicas.

Prioritizing snapshots to leaseholders is an important availability improvement.
The leaseholder may be so far behind on its log that it does not even realize
that it holds the lease. In such cases, the range is unavailable for reads and
writes until the leaseholder receives its snapshot, so send one ASAP.

Beyond leaseholders, we prioritize snapshots to voter replicas because a voter
in need of a snapshot can not vote for new proposals, so it may be needed to
achieve quorum on its range for write availability.

There's room to go further here. For instance, we could make a determination
about the relative importance of a given voter snapshot based on whether its
range is unavailable or not. This commit does not try to do something like this.

Release note (bug fix): Raft snapshots to replicas are now prioritized based on
their relative importance for availability. Snapshots to leaseholders are given
the highest priority, followed by snapshots to voter replicas, followed by
snapshots to learner replicas.

@nvb nvb requested review from erikgrinaker and kvoli April 30, 2022 01:33
@nvb nvb requested a review from a team as a code owner April 30, 2022 01:33
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

This commit adds prioritization to the `raftSnapshotQueue` so that it sends
Raft snapshots to the replicas that are most in need of snapshots first. The
prioritization order is to first send snapshots to leaseholder replicas, then
voter replicas, then learner replicas.

Prioritizing snapshots to leaseholders is an important availability improvement.
The leaseholder may be so far behind on its log that it does not even realize
that it holds the lease. In such cases, the range is unavailable for reads and
writes until the leaseholder receives its snapshot, so send one ASAP.

Beyond leaseholders, we prioritize snapshots to voter replicas because a voter
in need of a snapshot can not vote for new proposals, so it may be needed to
achieve quorum on its range for write availability.

There's room to go further here. For instance, we could make a determination
about the relative importance of a given voter snapshot based on whether its
range is unavailable or not. This commit does not try to do something like this.

Release note (bug fix): Raft snapshots to replicas are now prioritized based on
their relative importance for availability. Snapshots to leaseholders are given
the highest priority, followed by snapshots to voter replicas, followed by
snapshots to learner replicas.
@nvb nvb force-pushed the nvanbenschoten/preferLeaseholderSnapshot branch from b231e66 to 3cf661b Compare May 2, 2022 03:50
@tbg
Copy link
Copy Markdown
Member

tbg commented May 2, 2022

Do we have to worry about getting (in much rarer cases) the problems back that we solved (a long time ago) when we made the queue FIFO (as opposed to LIFO) at equal priorities, i.e. this code?

if a.priority == b.priority {
// When priorities are equal, we want the lower sequence number to show
// up first (FIFO).
return a.seq < b.seq
}
?

The examples are somewhat contrived but basically snapshots can be necessary to break dependency edges between ranges.

r1/1 r1/2 r1/3 splits off r2/1 r2/2 r2/3

r2/3 is slow to apply the snapshot. In fact it hasn't even received the log entry. But somehow the lease got transferred to it. So r2 is unavailable until r1/3 applies the split trigger.

Follower r1/2 is down. r1/1 has the split applied. Follower r1/3 doesn't have the split yet and needs a snapshot.

In this situation, prioritizing the leaseholder snapshot over the follower snapshot is a deadlock, since it will fail deterministically, until the voter-snapshot for r1/3 has been issued.

Sure, this example is contrived, but are there others? Ideally we'd have some robust way of trying to send a snapshot to the leaseholder once, and if that fails, not prioritize it in the same way again to avoid starvation.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 2, 2022

This is a good point. I was concerned that any loss of fairness (i.e. prioritization) could lead to starvation, but could not construct a scenario.

In this situation, prioritizing the leaseholder snapshot over the follower snapshot is a deadlock, since it will fail deterministically, until the voter-snapshot for r1/3 has been issued.

I'd like to understand this case further. Ideally, there would not be dependency edges from one range/replica on the availability of another. Why does the snapshot to r2/3 fail deterministically? Does this have to do with the splitDelayHelper?

@tbg
Copy link
Copy Markdown
Member

tbg commented May 6, 2022

A snapshot for r2/3 overlaps with the (pre-split) replica r1/3, which is not replicaGC'able (it is still in the descriptor, just needs to catch up).

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 6, 2022

Ah, that's right. The logic lives in Store.checkSnapshotOverlapLocked.

I see why we have this logic for the case where the pre-split replica can catch up on its log. However, it seems like a meaningful limitation with no benefit for cases where the pre-split replica needs a snapshot from the LHS of the split. In these cases, we can't avoid two (or more, for multiple splits) snapshots, and it seems unfortunate to have dependencies between snapshots for different ranges. Doing so prevents us from introducing prioritization between snapshots. I'm guessing it also hurts the time to recover a behind node because now some snapshots need to wait for others and there will be blocking/thrashing until the right sender finds the right snapshot to unblock everyone.

I'm surprised we haven't seen this in the splits/largerange/size=32GiB,nodes=6 roachtest. Or maybe we have and I'm just not seeing it in the handful of times that you have debugged the test.

One option to avoid this issue would be to detect that a pre-split replica needs a snapshot and remove it while allowing the RHS snapshot though. Abstractly, this sounds good. In practice, it runs into the problem that a follower replica doesn't know when it needs a snapshot, only the leader knows. It also runs into the problem that we'd be removing a replica while it is still in the descriptor, which isn't a fundamental limitation, but is also not something we do today.

Another option to avoid this form of priority inversion and risk of starvation would be to build and use the dependency graph directly to make prioritization decisions. Concretely, we could pass back the range ID of the overlapping range from checkSnapshotOverlapLocked to the sender's raftSnapshotQueue. After doing so, the raftSnapshotQueue could conduct a graph that might look something like:

type snapshotDeps struct {
    deps              map[roachpb.RangeID]struct{}
    maxPriorityOfDeps float64
}

type snapshotDepGraph struct {
    deps    map[roachpb.RangeID]snapshotDeps
    prereqs map[roachpb.RangeID]map[roachpb.RangeID]struct{}
}

When receiving an error from checkSnapshotOverlapLocked, the raftSnapshotQueue could update this graph to include the current range's dependency. It could also enqueue the dependency and use the graph to assign a priority to the dependency's snapshot of max(priority(repl), maxPriorityOfDeps + 1)).

This would help for dependencies whose leader was on the same node. While I think this may actually be the common case because we're dealing with recently split ranges, we still need to have an answer for leaders on different nodes. I don't see any good options that don't require new communication channels. We could have raftSnapshotQueues communicate.

Another option is that we pass the responsibility to escalate the priority of snapshots to the LHS replica in checkSnapshotOverlapLocked. That replica could be told that "this snapshot with priority P is being delayed because of you, learn about the split in your near future with priority P+1". It could then communicate that information to its leader, who could feed this into the raftSnapshotQueue. In the case of transitive dependencies, this could get passed up the chain of splits. A benefit of this approach is that lifetimes in this priority scheme become clear (remember higher priority until snapshot succeeds), and we may be able to use the existing heartbeat channel to pass information.

I'm also now seeing that this is a known issue:

// NB: if the snapshot fails because of an overlapping replica on the
// recipient which is also waiting for a snapshot, the "smart" thing is to
// send that other snapshot with higher priority. The problem is that the
// leader for the overlapping range may not be this node. This happens
// typically during splits and merges when overly aggressive log truncations
// occur.
//
// For splits, the overlapping range will be a replica of the pre-split
// range that needs a snapshot to catch it up across the split trigger.
//
// For merges, the overlapping replicas belong to ranges since subsumed by
// this range. In particular, there can be many of them if merges apply in
// rapid succession. The leftmost replica is the most important one to catch
// up, as it will absorb all of the overlapping replicas when caught up past
// all of the merges.
//
// We're currently not handling this and instead rely on the quota pool to
// make sure that log truncations won't require snapshots for healthy
// followers.

One final thought is that the retry story for failed Raft snapshots is naive (or I'm naive and don't understand it). I think it is driven by Raft ticks, which fire every 200ms. Every 5 of those we heartbeat (defaultRaftHeartbeatIntervalTicks). In response to a heartbeat, we may send a MsgSnap, which enqueued into the raftSnapshotQueue. I wonder whether there is a role for the snapshot queue's purgatory for snapshots that we know are going to fail.

@erikgrinaker
Copy link
Copy Markdown
Contributor

Dealing with these dependency graphs sounds complex. Do we understand why a replica in need of a snapshot got the lease in the first place?

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 17, 2022

We did determine the reason why we were repeatedly transferring the lease to a replica that was in need of a snapshot. It was due to a bug that was recently fixed on master. See the later comments in https://github.com/cockroachlabs/support/issues/1569.

However, even if we're confident we've solved this one case, I still think we should do something here, as this keeps coming up in different forms. There are a number of reasons why we might want to prioritize one snapshot over another:

  • leaseholder vs. not leaseholder
  • system range vs. data range
  • causing unavailability vs. not
  • voter vs. learner
  • raft snapshot vs. upreplication snapshot

In each of these cases, prioritizing snapshots to the "right" replicas can mean the difference between an outage that lasts seconds and one that lasts hours. For instance, we have seen multiple instances in the past month where an important snapshot got stuck behind a wave of decommissioning snapshots, causing a much longer outage than necessary.

Arguably some of these cases shouldn't be possible. For instance, I'd like for us to be able to make stronger guarantees around a leaseholder never needing a snapshot. To make such a guarantee, we'll need to rethink the lease transfer protocol — the current protocol is non-cooperative and inherently racy. Even if we do think we have certain guarantees, I think it's important that we be able to recover from mistakes as quickly as possible, given the severity of not doing so.

I do agree that the dependency graphs sound complex and I'd prefer a simpler solution. I think my preferred solution would be to eliminate all inter-range dependencies so that the recovery of one range never depends on that of another. That's not currently possible, given the way we handle overlapping snapshots.

All of this needs more thought.

@tbg
Copy link
Copy Markdown
Member

tbg commented May 17, 2022

I think my preferred solution would be to eliminate all inter-range dependencies so that the recovery of one range never depends on that of another.

That strikes me as the right goal. A long time back (5 years?) we talked about having a prefix so that ranges are guaranteed to have their own keyspace (i.e. conceptually replicas wouldn't "share storage"). In such a world, it would be perfectly legitimate to send a snapshot for the RHS if the pre-split LHS is still in place. Since the KV server routes requests not by key range but by target replica desc, I think everything would just work (there will be some unanticipated problems of course...). But of getting the separate keyspace in place would be a pretty large undertaking.

We could be introducing rocksdb-style column families to pebble, and using a column family per replica. When this was last discussed with @cockroachdb/storage , they weren't exactly excited about it 😆 but I wonder if now that a few years have passed, if we can maybe identify other upshots to this approach. For example, separating the column families could play sort of the role of (I think) guards in PebblesDB, i.e. it can give pebble a hint about boundaries that it shouldn't compact across. And perhaps this would dovetail nicely with the disaggregated storage RFC, since the distributed storage would naturally end up keeping per-replica snapshots, which should be desirable.

We could of course also add a userspace (kvserver) prefix, say the <rangeID,replicaID> pair, which is simpler in terms of technology but a lot more leaky.

I'm curious if we have any ideas other than those two. (Having a separate pebble instance per Replica is guaranteed to be a non-starter due to the difficulty of all of these pebble instances sharing resources; column families are effectively the solution for that).

We could also work around the dependencies on an ad-hoc basis. For example, if a snapshot for the RHS arrives, we could "truncate" the pre-split LHS in-place with an override. That way, it would still "be" the old range, but we wouldn't allow any access to the keyspace that now belongs to the RHS. But this opens many cans of worms - what if a command was in raft and executes on the LHS that needs to write to the RHS, etc, we need to make the RHS portions noops, and then how to do the stats, etc) - it's almost guaranteed to not be the solution we'd ever embark on, but I wanted to mention it.

@tbg
Copy link
Copy Markdown
Member

tbg commented May 17, 2022

For a more sane ad-hoc solution to the problem discussed here, if a RHS snapshot is attempted, instead of opaquely saying "has overlap, no can do" we could also return more precise information: the LHS replica is still around and hasn't caught up. The sender could then prioritize the overlapping LHS replica (of which it hopefully still has a copy) for a snapshot. This cannot cause starvation, since the left-hand side replica is initialized (it overlaps our keyspace after all). An initialized range is not waiting for its left-hand side, so this range should be able to receive a snapshot just fine (if it needs one), and that snapshot will unblock the RHS. In fact, the snapshot might end up catching the LHS up across dozens of splits (we see these split cascades, where a range is split 100x but one follower doesn't apply these splits, leading to all 100 RHS replicas on the slow node to be stuck in limbo). So unless I'm missing other ways in which starvation could result from this (something something merges), this seems like a reasonable change.

@petermattis
Copy link
Copy Markdown
Collaborator

We could be introducing rocksdb-style column families to pebble, and using a column family per replica. When this was last discussed with https://github.com/orgs/cockroachdb/teams/storage , they weren't exactly excited about it 😆 but I wonder if now that a few years have passed, if we can maybe identify other upshots to this approach.

RocksDB column families create a separate LSM tree per column family, all of which share the same WAL. So you get atomicity across mutations to different column families (due to the shared WAL), but can configure different LSM heuristics per column family. I'm pretty sure this isn't what you're after here, as having O(num-replicas) LSM trees per store would be bad for a variety of reasons (smaller sstables, more compactions, etc).

@tbg
Copy link
Copy Markdown
Member

tbg commented May 17, 2022

RocksDB column families create a separate LSM tree per column family, all of which share the same WAL. So you get atomicity across mutations to different column families (due to the shared WAL), but can configure different LSM heuristics per column family. I'm pretty sure this isn't what you're after here, as having O(num-replicas) LSM trees per store would be bad for a variety of reasons (smaller sstables, more compactions, etc).

Yeah, you're right that this is not what we want. We'd need to be able to mix overlapping data within an SST, so it starts sounding a lot like pushing the prefix into pebble (where it can maybe be used for explicit decision making, but not sure that buys much).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants