Skip to content

kv: enable delegate snapshots#83991

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andrewbaptist:follower_snapshots
Feb 3, 2023
Merged

kv: enable delegate snapshots#83991
craig[bot] merged 1 commit intocockroachdb:masterfrom
andrewbaptist:follower_snapshots

Conversation

@andrewbaptist
Copy link
Copy Markdown

@andrewbaptist andrewbaptist commented Jul 7, 2022

kvserver: delegate snapshots to followers

Fixes: #42491

This commit allows a snapshot to be sent by a follower instead of the
leader of a range. The follower(s) are chosen based on locality to the
final recipient of the snapshot. If the follower is not able to
quickly send the snapshot, the attempt is aborted and the leader sends
the snapshot instead.

By choosing a delegate rather than sending the snapshot directly, WAN
traffic can be minimized. Additionally the snapshot will likely be
delivered faster.

There are two settings that control this feature. The first,
kv.snapshot_delegation.num_follower, controls how many followers
the snapshot is attempted to be delegated through. If set to 0, then
snapshot delegation is disabled. The second,
kv.snapshot_delegation_queue.enabled, controls whether delegated
snapshots will queue on the delegate or return failure immediately. This
is useful to prevent a delegation request from spending a long time
waiting before it is sent.

Before the snapshot is sent from the follower checks are done to
verify that the delegate is able to send a snapshot that will be valid
for the recipient. If not the request is rerouted to the leader.

Release note (performance improvement): Adds delegated snapshots which can reduce WAN traffic for snapshot movement.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andrewbaptist
Copy link
Copy Markdown
Author

This is the restore of the initial PR from amygao9@7c525d9 - this is still a WIP

@andrewbaptist andrewbaptist force-pushed the follower_snapshots branch 9 times, most recently from 94241ca to 1a6329c Compare July 12, 2022 17:43
@andrewbaptist andrewbaptist changed the title Initial commit kv: complete delegate snapshots Jul 12, 2022
@andrewbaptist andrewbaptist force-pushed the follower_snapshots branch 5 times, most recently from 2e76245 to 89ad698 Compare July 25, 2022 19:22
@andrewbaptist andrewbaptist changed the title kv: complete delegate snapshots kv: enable delegate snapshots Jul 26, 2022
@andrewbaptist andrewbaptist force-pushed the follower_snapshots branch 8 times, most recently from 432c493 to f8d5d89 Compare August 1, 2022 18:25
@andrewbaptist andrewbaptist force-pushed the follower_snapshots branch 2 times, most recently from 7e6fa1d to 5878898 Compare September 28, 2022 13:05
@andrewbaptist andrewbaptist force-pushed the follower_snapshots branch 2 times, most recently from ccd450b to e02c7ba Compare November 29, 2022 20:14
@andrewbaptist andrewbaptist marked this pull request as ready for review November 29, 2022 20:45
@andrewbaptist andrewbaptist requested a review from a team as a code owner November 29, 2022 20:45
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 19 files at r4, 17 of 17 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @andrewbaptist)


pkg/cmd/roachtest/tests/decommissionbench.go line 90 at r5 (raw file):

	timeout time.Duration

	// An override for the decommission node to make it choose a predictable node.

nit: For readers that are new to this code, an "instead of a random node" at the end would be helpful.


pkg/kv/kvserver/replica_command.go line 2536 at r5 (raw file):

	coordinator, err := r.GetReplicaDescriptor()
	if err != nil {
		// If there is no local replica descriptor, return an empty list.

Should we return this error instead of requiring callers to handle an empty list?


pkg/kv/kvserver/replica_command.go line 2579 at r5 (raw file):

	// Get the localities of the candidate replicas, including the original sender.
	localities := storePool.GetLocalitiesPerReplica(candidates)
	recipientLocality := storePool.GetLocalitiesPerReplica([]roachpb.ReplicaDescriptor{recipient})[recipient]

nit: consider making the function variadic.


pkg/kv/kvserver/replica_command.go line 2583 at r5 (raw file):

	// Construct a map from replica to its diversity score compared to the
	// recipient. Also track the best score we see.
	replicaDistance := make(map[roachpb.ReplicaDescriptor]float64)

nit: presize the capacity to len(localities)


pkg/kv/kvserver/replica_command.go line 2611 at r5 (raw file):

Shuffle the replicas to prevent always choosing them in the same order.

Won't the rng with the same seed lead to the replicas always being shuffled in the same order? Is the randomness coming from the map iteration above?


pkg/kv/kvserver/replica_command.go line 2817 at r5 (raw file):

	// Check whether we should queue on the delegate.
	delegateRequest.QueueOnDelegateLen = MaxQueueOnDelegateLimit.Get(&r.ClusterSettings().SV)

nit: why is this pulled outside of the struct initialization?


pkg/kv/kvserver/replica_command.go line 2845 at r5 (raw file):

		if retErr == nil {
			return
		}

If we're looping and retrying on a different sender blindly without checking for a specific type of error, then it's probably best to at least log the errors that we're seeing here when looping. I'm sure we'll unintentionally delegate to multiple senders on an error that we weren't expecting at least once while stabilizing this work. Better to at least have some evidence of what the error is when that happens.


pkg/kv/kvserver/replica_command.go line 2857 at r5 (raw file):

// is validated twice, once before "queueing" and once after. This reduces the
// chance of false positives (snapshots which are sent but can't be used),
// however it is impossible to completely eliminate them.

"impossible to completely eliminate them" is too strong of wording, and it also doesn't help the reader understand the interactions here. Consider expanding on this point as a way to illustrate why we don't make stronger guarantees. What would we have to synchronize with to guarantee that we will never hit a false positive? Freeze all range splits, merge, rebalances, leadership changes, and log truncations?


pkg/kv/kvserver/replica_command.go line 2861 at r5 (raw file):

	ctx context.Context, req *kvserverpb.DelegateSnapshotRequest,
) error {
	// If the generation has changed, this snapshot will be useless, so don't

Why does a descriptor generation change necessarily cause the snapshot to be useless?


pkg/kv/kvserver/replica_command.go line 2863 at r5 (raw file):

	// If the generation has changed, this snapshot will be useless, so don't
	// attempt to send it.
	if r.Desc().Generation != req.DescriptorGeneration {

nit: pull desc := r.Desc() out so that there's only a single instance.


pkg/kv/kvserver/replica_command.go line 2875 at r5 (raw file):

	// Check that the snapshot we generated has a descriptor that includes the
	// recipient. If it doesn't, the recipient will reject it, so it's better to

I'm surprised that we need this check if we're already checking that the descriptor generation is the same on the delegate and the leaseholder. Is that the case in the examples you discuss in this comment?


pkg/kv/kvserver/replica_command.go line 2895 at r5 (raw file):

	// Check the raft applied state index and term to determine if this replica
	// is not too far behind the leaseholder that it needs a snapshot.

Play this comment out one more step. If the delegate is too far behind that is also needs a snapshot, then any snapshot it sends will be useless.


pkg/kv/kvserver/replica_command.go line 2909 at r5 (raw file):

	// Delegate has a different term than the coordinator. This typically means
	// the lease has been transferred, and we should not process this request.

Can you explain why we shouldn't process the request in this case?


pkg/kv/kvserver/replica_command.go line 2925 at r5 (raw file):

	}

	// Sender replica's will be rejected if the sender replica's raft applied index is lower than

"Sender replica's snapshot"?


pkg/kv/kvserver/replica_command.go line 2927 at r5 (raw file):

	// Sender replica's will be rejected if the sender replica's raft applied index is lower than
	// or equal to the truncated state on the leaseholder, as this replica's snapshot will be wasted.
	// Note that its possible that we can enforce strictly lesser than if etcd does not require

s/its/it is/


pkg/kv/kvserver/replica_command.go line 2950 at r5 (raw file):

		settings.SystemOnly,
		"kv.snapshot_delegation.num_follower",
		"the number of delegates to try when sending snapshots",

"when sending snapshots, before falling back to sending from the leaseholder"


pkg/kv/kvserver/replica_command.go line 3011 at r5 (raw file):

	// obtain the send semaphore. We check after to make sure the snapshot request
	// still makes sense (e.g the range hasn't split under us). The check is
	// lightweight, so better to be safe than sorry.

This makes it sound like the check before the reservation is best-effort and the check after is properly synchronized and makes a stronger guarantee. But that's not correct, is it? Consider being explicit about this in these two comments, like you are in the comment on validateSnapshotDelegationRequest ("however it is impossible to completely eliminate them").


pkg/kv/kvserver/replica_command.go line 3034 at r5 (raw file):

	// TODO: Review this to make sure that there isn't any inconsistency since we
	// may be on a delegate. The GetSnapshot call will add a LogTruncation
	// constraint, but it isn't clear that this will help.

This is a good call. We want the log truncation constraint to be created on the leaseholder, not on the delegate.


pkg/kv/kvserver/store_snapshot.go line 696 at r5 (raw file):

	}
	// If the queue is too long on a delegate, don't attempt to wait before sending.
	if req.QueueOnDelegateLen >= 0 {

Do we want to push this queue depth limit into the MultiQueue itself to avoid raciness? Currently, this limit can be exceeded by multiple concurrent calls to reserveSendSnapshot.


pkg/kv/kvserver/testing_knobs.go line 427 at r5 (raw file):

	// send snapshot semaphore.
	AfterSendSnapshotThrottle func()
	// SelectDelegateSnapshotSender returns a replica to send delegated snapshots.

"returns a replica"

It returns a slice, so this contract isn't quite right. Want to explain what it means for multiple replicas to be returned?


pkg/kv/kvserver/allocator/storepool/store_pool.go line 1208 at r5 (raw file):

func (sp *StorePool) GetLocalitiesPerReplica(
	replicas []roachpb.ReplicaDescriptor,
) map[roachpb.ReplicaDescriptor]roachpb.Locality {

Would it be more natural for this to map from ReplicaID to Locality? The ReplicaDescriptor struct contains a number of fields, some of which (e.g. Type) are mutable over the lifetime of a replica.


pkg/kv/kvserver/kvserverpb/raft.proto line 274 at r5 (raw file):

  // The priority of the snapshot.
  // TODO(abaptist): Remove this field for v23.1.

Can we remove them now?


pkg/kv/kvserver/kvserverpb/raft.proto line 286 at r5 (raw file):

  // The truncated state of the Raft log on the coordinator replica.
  uint64 truncated_index = 8;

If we're changing this, we should evaluate whether we want to send the "truncated index" or the "first index" (truncated index + 1). I believe it is more common around the Raft layer to refer to the log's first index, but I could be wrong about that. Doing so would allow you to use the existing Replica.GetFirstIndex method, instead of introducing the new Replica.GetRaftTruncatedState method.

@andrewbaptist andrewbaptist requested a review from a team as a code owner December 29, 2022 17:44
@andrewbaptist andrewbaptist requested review from a team, herkolategan and smg260 and removed request for a team December 29, 2022 17:44
Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andrewbaptist, @herkolategan, @nvanbenschoten, and @smg260)


pkg/cmd/roachtest/tests/decommissionbench.go line 90 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: For readers that are new to this code, an "instead of a random node" at the end would be helpful.

Done


pkg/kv/kvserver/replica_command.go line 2536 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we return this error instead of requiring callers to handle an empty list?

Done


pkg/kv/kvserver/replica_command.go line 2579 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider making the function variadic.

Done


pkg/kv/kvserver/replica_command.go line 2583 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: presize the capacity to len(localities)

Done


pkg/kv/kvserver/replica_command.go line 2611 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Shuffle the replicas to prevent always choosing them in the same order.

Won't the rng with the same seed lead to the replicas always being shuffled in the same order? Is the randomness coming from the map iteration above?

For any given replica, they will shuffle in the same way, but if there are multiple different replicas that are being transmitted they will shuffle differently. This is typically only called once per replica, because if it fails then it falls back to the leaseholder.


pkg/kv/kvserver/replica_command.go line 2817 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: why is this pulled outside of the struct initialization?

Done


pkg/kv/kvserver/replica_command.go line 2845 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're looping and retrying on a different sender blindly without checking for a specific type of error, then it's probably best to at least log the errors that we're seeing here when looping. I'm sure we'll unintentionally delegate to multiple senders on an error that we weren't expecting at least once while stabilizing this work. Better to at least have some evidence of what the error is when that happens.

Done, good point!


pkg/kv/kvserver/replica_command.go line 2857 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"impossible to completely eliminate them" is too strong of wording, and it also doesn't help the reader understand the interactions here. Consider expanding on this point as a way to illustrate why we don't make stronger guarantees. What would we have to synchronize with to guarantee that we will never hit a false positive? Freeze all range splits, merge, rebalances, leadership changes, and log truncations?

Done


pkg/kv/kvserver/replica_command.go line 2863 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: pull desc := r.Desc() out so that there's only a single instance.

Done


pkg/kv/kvserver/replica_command.go line 2875 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm surprised that we need this check if we're already checking that the descriptor generation is the same on the delegate and the leaseholder. Is that the case in the examples you discuss in this comment?

You are right about this being redundant. I don't see how this could happen either if the descriptor generation has changed. I could change this to an assertion, or remove the check.


pkg/kv/kvserver/replica_command.go line 2895 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Play this comment out one more step. If the delegate is too far behind that is also needs a snapshot, then any snapshot it sends will be useless.

Done, Added comment


pkg/kv/kvserver/replica_command.go line 2925 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"Sender replica's snapshot"?

Done


pkg/kv/kvserver/replica_command.go line 2927 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/its/it is/

Done


pkg/kv/kvserver/replica_command.go line 2950 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"when sending snapshots, before falling back to sending from the leaseholder"

Done


pkg/kv/kvserver/replica_command.go line 3011 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This makes it sound like the check before the reservation is best-effort and the check after is properly synchronized and makes a stronger guarantee. But that's not correct, is it? Consider being explicit about this in these two comments, like you are in the comment on validateSnapshotDelegationRequest ("however it is impossible to completely eliminate them").

Done, Good point and I cleaned up the comment on this.


pkg/kv/kvserver/replica_command.go line 3034 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a good call. We want the log truncation constraint to be created on the leaseholder, not on the delegate.

I'm not positive about the best way to handle this. If this is called on the leaseholder, then it will correctly have the constraint in place, but if it is called on a delegate then there isn't really a way to guarantee this. I could change to make the constraint occur much earlier (on the leaseholder before it sends the delegation request) but that risks holding the constraint for a longer window. I don't think it would be hard to make that change, so maybe that is the way to go here.


pkg/kv/kvserver/store_snapshot.go line 696 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want to push this queue depth limit into the MultiQueue itself to avoid raciness? Currently, this limit can be exceeded by multiple concurrent calls to reserveSendSnapshot.

Done, probably wouldn't have been an issue, but since this is the only use of the MultiQueue, simpler to do this.


pkg/kv/kvserver/testing_knobs.go line 427 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"returns a replica"

It returns a slice, so this contract isn't quite right. Want to explain what it means for multiple replicas to be returned?

Done


pkg/kv/kvserver/allocator/storepool/store_pool.go line 1208 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would it be more natural for this to map from ReplicaID to Locality? The ReplicaDescriptor struct contains a number of fields, some of which (e.g. Type) are mutable over the lifetime of a replica.

Done, it requires an extra loop to convert back to descriptors, but probably the right thing to do as it makes the equality safer.


pkg/kv/kvserver/kvserverpb/raft.proto line 274 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we remove them now?

Unfortunately, this is a bigger change than I had planned for this patch. I was going to do this as a follow-on PR. I chatted with @AlexTalks a little about this and there are UI changes related to it. I think this can still get done in 23.1, but I didn't want to tie it to this. If we later remove these fields, it will not matter as the only thing they are still used for is reporting stats right now.


pkg/kv/kvserver/kvserverpb/raft.proto line 286 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're changing this, we should evaluate whether we want to send the "truncated index" or the "first index" (truncated index + 1). I believe it is more common around the Raft layer to refer to the log's first index, but I could be wrong about that. Doing so would allow you to use the existing Replica.GetFirstIndex method, instead of introducing the new Replica.GetRaftTruncatedState method.

Done, This makes more sense now.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 20 of 21 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andrewbaptist, @herkolategan, and @smg260)


pkg/kv/kvserver/replica_command.go line 2861 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why does a descriptor generation change necessarily cause the snapshot to be useless?

I'm still curious about this. Are there descriptor generation changes that don't necessarily mean this, but we're ok with false positives?


pkg/kv/kvserver/replica_command.go line 3034 at r5 (raw file):

but that risks holding the constraint for a longer window

The leaseholder also doesn't know which index to bound log truncation up to. It may have applied further than the delegated sender, and it doesn't want to truncate the log any higher than the index that the delegated snapshot is being sent.

I think this is why the RPC protocol was streaming before. We had envisioned a handshake where the leaseholder would ask the delegate for its applied index, install a constraint at that index (rechecking that it hadn't truncated above this already), then allow the delegated snapshot to proceed.


pkg/kv/kvserver/replica_command.go line 2591 at r6 (raw file):

	// regardless of score.
	var tiedReplicas []roachpb.ReplicaID
	for desc, score := range replicaDistance {

s/desc/replID/


pkg/kv/kvserver/replica_command.go line 2611 at r6 (raw file):

	// Only keep the top numFollowers replicas.
	if len(tiedReplicas) > numFollowers {
		tiedReplicas = tiedReplicas[0:numFollowers]

small nit: the 0 is unnecessary.


pkg/kv/kvserver/replica_command.go line 2619 at r6 (raw file):

	for n, replicaId := range tiedReplicas {
		found := false
		for _, desc := range rangeDesc.Replicas().Descriptors() {

Can we use ReplicaSet.GetReplicaDescriptorByID here?

Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @herkolategan, @nvanbenschoten, and @smg260)


pkg/kv/kvserver/replica_command.go line 2861 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm still curious about this. Are there descriptor generation changes that don't necessarily mean this, but we're ok with false positives?

Just to record what we discussed and added a comment to it. This check probably is a little overly strict, which means some delegated snapshots may be rejected unnecessarily. This is not really a problem and should be pretty rare. We could likely remove this check in the future.


pkg/kv/kvserver/replica_command.go line 3034 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

but that risks holding the constraint for a longer window

The leaseholder also doesn't know which index to bound log truncation up to. It may have applied further than the delegated sender, and it doesn't want to truncate the log any higher than the index that the delegated snapshot is being sent.

I think this is why the RPC protocol was streaming before. We had envisioned a handshake where the leaseholder would ask the delegate for its applied index, install a constraint at that index (rechecking that it hadn't truncated above this already), then allow the delegated snapshot to proceed.

After discussing options on this I added the log truncation constraint up front on the coordinator and there is a risk of holding this longer than strictly required. There is a path to change this if necessary in the future, but it is unlikely to be a concern for most systems. https://cockroachlabs.atlassian.net/wiki/spaces/~6268113f52310b0068ffd245/pages/2869854482/Delegate+snapshot+overview
There are a few ways to solve this, but I don't think the streaming RPC is correct since the data required is disjoint for the different parts of the flow. There is a risk with the latest change that a delegated requests will fail since it won't have the latest applied index from the leaseholder. If that occurs often it would be worth changing to retrying on the delegate since it will "eventually" catch up in most cases.


pkg/kv/kvserver/replica_command.go line 2591 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/desc/replID/

Done


pkg/kv/kvserver/replica_command.go line 2611 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

small nit: the 0 is unnecessary.

Done


pkg/kv/kvserver/replica_command.go line 2619 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we use ReplicaSet.GetReplicaDescriptorByID here?

Done

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 18 of 20 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andrewbaptist, @herkolategan, and @smg260)


pkg/kv/kvserver/replica_command.go line 2821 at r7 (raw file):

		Term:                 status.Term,
		DelegatedSender:      sender,
		FirstIndex:           appliedIndex,

Could you add a comment here about why you are setting FirstIndex to the value of appliedIndex on the leaseholder? It's subtle and could benefit from a discussion about the consequences and plans for future improvement.


pkg/kv/kvserver/replica_command.go line 2880 at r7 (raw file):

	// If the generation has changed, this snapshot may be useless, so don't
	// attempt to send it.
	//NB: This is an overly strict check. If other delegates are added to this

nit: missing a space after // on each line.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andrewbaptist, @herkolategan, @nvanbenschoten, and @smg260)


-- commits line 29 at r8:
nit: consider adding more words to the release note for this awesome work so that the average user can understand it

Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @herkolategan, @nvanbenschoten, and @smg260)


pkg/kv/kvserver/replica_command.go line 2821 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment here about why you are setting FirstIndex to the value of appliedIndex on the leaseholder? It's subtle and could benefit from a discussion about the consequences and plans for future improvement.

Done


pkg/kv/kvserver/replica_command.go line 2880 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing a space after // on each line.

Done

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andrewbaptist, @herkolategan, and @smg260)


pkg/kv/kvserver/replica_command.go line 2821 at r7 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Done

Did you miss a git push?

Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @herkolategan, @nvanbenschoten, and @smg260)


-- commits line 29 at r8:

Previously, andreimatei (Andrei Matei) wrote…

nit: consider adding more words to the release note for this awesome work so that the average user can understand it

Thanks, I added some more to the release notes. I will likely publish a short blog no this as well when 23.1 comes out!


pkg/kv/kvserver/replica_command.go line 2821 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you miss a git push?

I'm not sure what happened - I rewrote and pushed again.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @andrewbaptist, @herkolategan, and @smg260)

Fixes: cockroachdb#42491

This commit allows a snapshot to be sent by a follower instead of the
leader of a range. The follower(s) are chosen based on locality to the
final recipient of the snapshot.  If the follower is not able to
quickly send the snapshot, the attempt is aborted and the leader sends
the snapshot instead.

By choosing a delegate rather than sending the snapshot directly, WAN
traffic can be minimized. Additionally the snapshot will likely be
delivered faster.

There are two settings that control this feature. The first,
`kv.snapshot_delegation.num_follower`, controls how many followers
the snapshot is attempted to be delegated through. If set to 0, then
snapshot delegation is disabled. The second,
`kv.snapshot_delegation_queue.enabled`, controls whether delegated
snapshots will queue on the delegate or return failure immediately. This
is useful to prevent a delegation request from spending a long time
waiting before it is sent.

Before the snapshot is sent from the follower checks are done to
verify that the delegate is able to send a snapshot that will be valid
for the recipient. If not the request is rerouted to the leader.

Release note (performance improvement): Adds delegated snapshots which
can reduce WAN traffic for snapshot movement. If there is another
replica for this range with a closer locality than the delegate, the
leaseholder will attempt to have that delegate send the snapshot. This
is particularly useful in the case of a decommission of a node where
most snapshots are transferred to another replica in the same locality.
@andrewbaptist
Copy link
Copy Markdown
Author

bors r=nvanbenschoten

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 3, 2023

Build succeeded:

@shralex
Copy link
Copy Markdown
Contributor

shralex commented Feb 3, 2023

congrats!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Send Raft snapshots between follower replicas

5 participants