Skip to content

kvserver: allow non-voting replicas to be created via ChangeReplicas#53715

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
aayushshah15:introduce_persistent_learners
Nov 6, 2020
Merged

kvserver: allow non-voting replicas to be created via ChangeReplicas#53715
craig[bot] merged 4 commits intocockroachdb:masterfrom
aayushshah15:introduce_persistent_learners

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Aug 31, 2020

This PR adds the pieces necessary to allow creation (and removal) of
non-voting replicas via AdminChangeReplicas requests.

Please see individual commits.

Informs #51943

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15
Copy link
Copy Markdown
Contributor Author

aayushshah15 commented Aug 31, 2020

I've renamed a bunch of things and thats mainly what makes this PR so big. I
hope that the fact that the renames are done in separate commits will make it
less of a burden to review.

Note that, without being able to relocate persistent learners, range merges
will not work as-is, since they require that the replica sets of the concerned
ranges are equal. I'm working on adding a commit to this PR that teaches
AdminRelocateRange relocate learners (which will enable range merges),
along with a lot more testing.

The meat of the change is in the last commit.

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 2 times, most recently from 42e9085 to 2988482 Compare August 31, 2020 22:39
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from 2988482 to 5cca07a Compare August 31, 2020 22:51
@aayushshah15 aayushshah15 requested a review from nvb September 10, 2020 17:32
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stands out to me is the potential for confusion between LEARNER_EPHEMERAL and LEARNER_PERSISTENT. This is kind of a bikeshed, but we should come to a firm decision on what we want to call these "long-lived learners" before introducing them. In the past, we've also referred to the concept as "non-voting replicas" and "read-only replicas". These have precedent in other databases, like in MongoDB and in Spanner. Meanwhile, "learner replicas" don't have much precedent, as far as I'm aware. I also actually think either name is better than "learner" as a user-facing component, as they're both more descriptive. "Learner" doesn't really tell me anything. "Non-voting" indicates that the replica will not impact write latencies. "Read-only" indicates that the replica can never be a leaseholder or accept writes.

An alternative to what we have here would be to keep the LEARNER ReplicaType synonymous with "ephemeral learner" and then introduce a NON_VOTER ReplicaType that takes the place of LEARNER_PERSISTENT. Both of these replica types would map to a "learner" at the Raft level, but that would be hidden from users and mostly just an implementation detail of how we interact with etcd/raft, similar to how all other ReplicaType's map to a "voter" and the Raft level. So the concept of a "learner replica" of any sort would be hidden from users.

We would then introduce "non-voting replicas" to users as new functionality in v21.1, alongside a few opinionated topology patterns that use non-voting replicas.

@bdarnell might have opinions about this.

Reviewed 34 of 34 files at r1, 4 of 4 files at r2, 30 of 30 files at r3, 10 of 10 files at r4, 20 of 20 files at r5, 14 of 14 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/raft_snapshot_queue.go, line 125 at r6 (raw file):

		//
		// Persistent learners, unlike ephemeral learners, rely on the raft snapshot
		// queue to upreplicate.

Explain why this is in this comment or point to an explanation is one exists elsewhere.


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

SnapshotRequest_LEARNER

Are these snapshot types only used to determine which metric to increment? If so, we might want to introduce a new form of snapshot, or more carefully distinguish between snapshots used to initially catch up LEARNER_PERSISTENT replicas (more similar to SnapshotRequest_LEARNER) and those used to catch them up if they ever fall behind (more similar to SnapshotRequest_RAFT).


pkg/kv/kvserver/raft_snapshot_queue.go, line 127 at r6 (raw file):

		// queue to upreplicate.
		snapType = SnapshotRequest_LEARNER
		if index := repl.getAndGCSnapshotLogTruncationConstraints(timeutil.Now(),

nit: the formatting of this is strange. Maybe:

if index := repl.getAndGCSnapshotLogTruncationConstraints(
    timeutil.Now(), repDesc.StoreID,
); isEphemeralLearner && index > 0 {

pkg/kv/kvserver/replica_application_state_machine.go, line 550 at r3 (raw file):

	// transitioned to VOTER_OUTGOING.

	// In 19.1 and before we used DeprecatedUpdatedReplicas instead of providing

Can we remove this? Isn't this below Raft? We don't currently have a way of ensuring that we aren't applying entries proposed to Raft in earlier versions and we need to keep Raft application deterministic.


pkg/kv/kvserver/replica_command.go, line 899 at r1 (raw file):

//    If no replicas are being added, this first step is elided.
//
//    If persistent replicas are being added, then this step is all we need. The

"persistent learner replicas"?


pkg/kv/kvserver/replica_command.go, line 1225 at r1 (raw file):

}

// addEphemeralLearner adds learners to the given replication targets.

This should stay plural, right?


pkg/kv/kvserver/replica_command.go, line 1574 at r4 (raw file):

	var crt *roachpb.ChangeReplicasTrigger
	crt = &roachpb.ChangeReplicasTrigger{

crt := ...


pkg/kv/kvserver/replica_command.go, line 1163 at r6 (raw file):

	}
	// First make sure that the changes don't self-overlap (i.e. we're not adding
	// a replica twice, or removing and immediately re-adding it).

Should this say "or adding and immediately removing it"? We specifically allow the inverse, don't we?


pkg/kv/kvserver/replica_test.go, line 6457 at r6 (raw file):

	t.Run(roachpb.ADD_VOTER.String(), func(t *testing.T) {
		test(t, roachpb.ADD_VOTER)

nit: stray line

Also, this test may be better written as a table-driven test.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

	pd.Local.GossipFirstRange = rec.IsFirstRange()

	var desc roachpb.RangeDescriptor

Can we just delete these three lines?


pkg/kv/kvserver/kvserverpb/log.proto, line 19 at r3 (raw file):

import "google/protobuf/timestamp.proto";

// TODO DURING REVIEW: Do we want to track persistent learners separately here?

I think we should.


pkg/roachpb/data.go, line 1498 at r4 (raw file):

// Replicas returns all of the replicas present in the descriptor after this
// trigger applies.
func (crt ChangeReplicasTrigger) Replicas() []ReplicaDescriptor {

Are any of these methods called below Raft? If so, we need to maintain backwards compatibility.


pkg/roachpb/data.go, line 1631 at r6 (raw file):

			// We're adding a persistent learner. Like the case above, we're guaranteed that
			// this learner is not a voter. Promotions of persistent learners to voters and
			// demotions vice-versa are not currently supported.

What will it take to support this? What happens after this PR if we try to make such a change?


pkg/roachpb/metadata.proto, line 95 at r2 (raw file):

  // take into account) votes of (peers they consider) LEARNER_PERSISTENTs for
  // leadership nor do their acknowledged log entries get taken into account for
  // determining the committed index.

I think it would be worth mentioning the quota pool and at least hinting at how persistent learners impact proposal backpressure.


pkg/roachpb/metadata_replicas.go, line 155 at r6 (raw file):

// ones that already have a snapshot) are expected to very soon be voters, so we
// treat them as such. However, it means a slow learner can slow down regular
// traffic, which is possibly counterintuitive.

But it's also what we want in most cases, right? We'd rather have a slow learner cause a range to be backpressured slightly instead of having a slow learner repeatedly request snapshots.


pkg/roachpb/metadata_replicas.go, line 156 at r6 (raw file):

// treat them as such. However, it means a slow learner can slow down regular
// traffic, which is possibly counterintuitive.

nit: break in the comment.


pkg/testutils/testcluster/testcluster.go, line 509 at r6 (raw file):

}

func (tc *TestCluster) AddPersistentLearners(

nit: needs a comment. Same below.

@bdarnell
Copy link
Copy Markdown
Contributor

An alternative to what we have here would be to keep the LEARNER ReplicaType synonymous with "ephemeral learner" and then introduce a NON_VOTER ReplicaType that takes the place of LEARNER_PERSISTENT.

I like this naming. I prefer to call the kind of replica introduced here "non-voting" instead of "read-only" because I could imagine more kinds of read-only or read-optimized replicas in the future (for example the consistent read replicas of #39758, or maybe using replica types to represent leader/lease preferences).

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from 5cca07a to 6d5f056 Compare October 15, 2020 02:33
@aayushshah15 aayushshah15 changed the title kvserver: allow persistent learners to be created via ChangeReplicas kvserver: allow non-voting replicas to be created via ChangeReplicas Oct 15, 2020
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 3 times, most recently from 932a8e0 to 494ac6c Compare October 19, 2020 22:40
@aayushshah15 aayushshah15 requested a review from a team as a code owner October 19, 2020 22:40
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 2 times, most recently from bf5e11f to 913c404 Compare October 20, 2020 14:16
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/raft_snapshot_queue.go, line 125 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Explain why this is in this comment or point to an explanation is one exists elsewhere.

Done, let me know if you think the comment it points to is lacking.


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
SnapshotRequest_LEARNER

Are these snapshot types only used to determine which metric to increment? If so, we might want to introduce a new form of snapshot, or more carefully distinguish between snapshots used to initially catch up LEARNER_PERSISTENT replicas (more similar to SnapshotRequest_LEARNER) and those used to catch them up if they ever fall behind (more similar to SnapshotRequest_RAFT).

I added one separate type of snapshot for the non-voter replicas. Since we're relying on the snapshot queue to send it the snapshots, its not obvious to me how to disambiguate between the first snapshot sent for upreplication and the rest. I could pass a new constant in addRaftLearners when I queue it up into the snapshot queue, but that seems kludgy. Is there a way for me to detect, on the sender side, if the replica receiving the snapshot has any data?


pkg/kv/kvserver/raft_snapshot_queue.go, line 127 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: the formatting of this is strange. Maybe:

if index := repl.getAndGCSnapshotLogTruncationConstraints(
    timeutil.Now(), repDesc.StoreID,
); isEphemeralLearner && index > 0 {

Fixed.


pkg/kv/kvserver/replica_application_state_machine.go, line 550 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we remove this? Isn't this below Raft? We don't currently have a way of ensuring that we aren't applying entries proposed to Raft in earlier versions and we need to keep Raft application deterministic.

Reverted


pkg/kv/kvserver/replica_command.go, line 899 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"persistent learner replicas"?

Fixed


pkg/kv/kvserver/replica_command.go, line 1225 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This should stay plural, right?

Fixed


pkg/kv/kvserver/replica_command.go, line 1574 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

crt := ...

N/A anymore since I reverted removal of the deprecated fields


pkg/kv/kvserver/replica_command.go, line 1163 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this say "or adding and immediately removing it"? We specifically allow the inverse, don't we?

We only seem to allow the "adding and removing" case and not the inverse. I starred at the code for a bit but I don't understand why.. Is this perhaps a vestige of the pre-joint-consensus times? As long as we're going through joint consensus and the StoreIDs involved in the transition are distinct, shouldn't it be okay either way?


pkg/kv/kvserver/replica_test.go, line 6457 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: stray line

Also, this test may be better written as a table-driven test.

Done.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we just delete these three lines?

We should be able to remove all the migration logic from 19.1 to 19.2, except for stuff that's downstream of raft proposals, right? I've removed the logic that handled 19.1 proposals from here.


pkg/kv/kvserver/kvserverpb/log.proto, line 19 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we should.

Done.


pkg/roachpb/data.go, line 1498 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are any of these methods called below Raft? If so, we need to maintain backwards compatibility.

Reverted the commit.


pkg/roachpb/metadata.proto, line 95 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it would be worth mentioning the quota pool and at least hinting at how persistent learners impact proposal backpressure.

Done


pkg/roachpb/metadata_replicas.go, line 155 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But it's also what we want in most cases, right? We'd rather have a slow learner cause a range to be backpressured slightly instead of having a slow learner repeatedly request snapshots.

Didn't mean to take the blame for this mega-comment. But we do want to keep this disclaimer around, right?


pkg/roachpb/metadata_replicas.go, line 156 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: break in the comment.

N/A


pkg/testutils/testcluster/testcluster.go, line 509 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: needs a comment. Same below.

Done.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

aayushshah15 commented Oct 20, 2020

Is there a way for me to detect, on the sender side, if the replica receiving the snapshot has any data?

Relatedly, I'm not sure (even though the mega-comment above Learners() in metadata_replicas.go claims it) if the quota-pool ignores new replicas (that haven't received a LEARNER snapshot yet) when deciding to throttle incoming proposals. I stared at the code for a while but couldn't see anything ensuring this. Briefly spoke to Tobi about it and he wasn't sure either.

This seems especially problematic for these non-voting replicas since they rely on the snapshot queue for upreplication but it doesn't seem great for the current case of short-lived learners either. @nvanbenschoten is this something we want to handle in this PR, or can I leave a TODO and deal with it in a new PR?

EDIT: Sorry for the noise, this case is handled by the proposalQuotaBaseIndex, I'll add some comments making this more clear.

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 2 times, most recently from cb1d07a to 55eca68 Compare October 20, 2020 18:53
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Good stuff.
I don't know anything about anything, but I think that the snapshot type enum is probably not the greatest way to do whatever it is that it tries to do - so I'd rather get rid of it instead of adding to it :P

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/raft.proto, line 141 at r10 (raw file):

  enum Type {
    VOTER_RAFT = 0;

please put comments on these. In particular mention that LEARNER_UPREPLICATE implies something about some queue delivering the snapshot?

I'm also hoping that we don't need a new type - particularly since I don't really know what these types are for in the first place.


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

Is there a way for me to detect, on the sender side, if the replica receiving the snapshot has any data?

Doesn't repl.RaftStatus().Progress have what you need? I think Match == 0?

I find this snapshot types pretty confusing so I think it'd be great if we can get rid of them, or at least not add one more. In fact, coming back after looking at more code, I don't understand anything about them. I assumed they're used for sender-side metrics, but they seem to be used for metrics on the receiver. If so, is that sane? Doesn't the receiver know enough about what its state is to not require any collaboration from the sender? Isn't it trivial for the receiver to do what Nathan is asking for above - disambiguate between an initial snapshot and a non-initial one?
As far as receiver-side metrics go, I think we should have a counter of non-initial snapshots applied. I don't think I care about any metrics besides that.


pkg/kv/kvserver/raft_snapshot_queue.go, line 113 at r10 (raw file):

	snapType := SnapshotRequest_VOTER_RAFT

	// A learner replica is either getting a snapshot of type

is this comment wrapped?


pkg/kv/kvserver/raft_snapshot_queue.go, line 114 at r10 (raw file):

	// A learner replica is either getting a snapshot of type
	// SnapshotRequest_LEARNER_UPREPLICATE by the node that's adding it or it's been orphaned

I would s/SnapshotRequest_LEARNER_UPREPLICATE/LEARNER_UPREPLICATE


pkg/kv/kvserver/raft_snapshot_queue.go, line 116 at r10 (raw file):

	// SnapshotRequest_LEARNER_UPREPLICATE by the node that's adding it or it's been orphaned
	// and it's about to be cleaned up by the replicate queue. Either way, no
	// point in also sending it a snapshot of type RAFT.

s/RAFT/VOTER_RAFT ?


pkg/kv/kvserver/raft_snapshot_queue.go, line 120 at r10 (raw file):

	isLearnerOrNonVoter := isLearner || repDesc.GetType() == roachpb.NON_VOTER
	if isLearnerOrNonVoter {
		if fn := repl.store.cfg.TestingKnobs.ReplicaSkipLearnerSnapshot; fn != nil && fn() {

Do we want this knob to apply to non-voters? It's only used by one test, so I don't think so.


pkg/kv/kvserver/raft_snapshot_queue.go, line 123 at r10 (raw file):

			return nil
		}
		// An learner replica is either getting a snapshot of type

Is this repeating the comment from above? If so, I think this is indeed the better place for it.


pkg/kv/kvserver/replica_command.go, line 1019 at r10 (raw file):

			iChgs := []internalReplicationChange{{target: rem, typ: internalChangeTypeRemoveNonVoter}}
			var err error
			desc, err = execChangeReplicasTxn(

nit: I think this call fits on one line instead of 3


pkg/kv/kvserver/replica_command.go, line 1261 at r10 (raw file):

}

// addRaftLearners adds etcd/raft learners to the given replication targets.

why did you change this name? The previous one seemed better to me


pkg/kv/kvserver/replica_command.go, line 1289 at r10 (raw file):

			// If we added a non-voter, queue it up into the raft snapshot queue so
			// that it receives its first LEARNER_UPREPLICATE snapshot relatively soon.
			r.store.raftSnapshotQueue.AddAsync(ctx, r, raftSnapshotPriority)

this comment makes it sound like the type LEARNER_UPREPLICATE is a big deal, but in reality it's just some crap we plumb through for some counters, right? If so, I wouldn't refer to it too much.


pkg/kv/kvserver/replica_command.go, line 1577 at r10 (raw file):

					return nil, errors.Errorf("expected target to be a NON_VOTER; got %s", prevTyp)
				}
				rDesc, _ = updatedDesc.RemoveReplica(chg.target.NodeID, chg.target.StoreID)

how come we don't care about the ok ret val? Should we be asserting it to be true? (same in the existing code above)


pkg/kv/kvserver/replica_command.go, line 1734 at r10 (raw file):

		// Log replica change into range event log.
		for _, tup := range []struct {

nit: this structure seems weird to me. As long as you're taking the blame, consider extracting an anonymous function and calling it twice (instead of this loop).


pkg/kv/kvserver/replica_gc_queue.go, line 65 at r8 (raw file):

)

// TODO(aayush): Do we want to track gc queue removals separately for voters and long-lived learners?

If you ask me, I don't think so.
If it wasn't a real question, a question TODO seems weird.


pkg/kv/kvserver/replica_raft.go, line 480 at r10 (raw file):

		// We must update the proposal quota even if we don't have a ready. Consider the
		// case when our quota is of size 1 and two out of three replicas have committed
		// one log entry while the third is lagging behind. When the third replica

spurious diff?


pkg/roachpb/data.go, line 1633 at r10 (raw file):

			// demotions vice-versa are not currently supported.
			//
			// TODO(aayush, andrei): If a draining leaseholder doesn't see any other voters

this seems to me like a very random place to put such a high-level comment. If you want it, consider putting it around lease transfers - e.g. in findTargetAndTransferLease.


pkg/roachpb/data.go, line 1634 at r10 (raw file):

			//
			// TODO(aayush, andrei): If a draining leaseholder doesn't see any other voters
			// in its locality, but sees a learner, rather than allowing the lease to the

s/the lease to the/the lease to be


pkg/roachpb/metadata.proto, line 86 at r7 (raw file):

  // account) votes of (peers they consider) LEARNERs for leadership nor do
  // their acknowledged log entries get taken into account for determining the
  // committed index. learners in CockroachDB are a short-term transient state:

s/learners/Learners


pkg/roachpb/metadata.proto, line 90 at r7 (raw file):

  // VOTER_DEMOTING being removed.
  LEARNER = 1;
  // NON_VOTER indicates a replica that applies committed entries, does not

applied committed entries, but does not ...


pkg/roachpb/metadata.proto, line 95 at r7 (raw file):

  // their acknowledged log entries get taken into account for determining the
  // committed index.
  // Under the hood, it is based on an etcd Learner, like the LEARNER replica

wanna spell out the differences from LEARNER more?


pkg/roachpb/metadata.proto, line 96 at r10 (raw file):

  // committed index.
  // Under the hood, it is based on an etcd Learner, like the LEARNER replica
  // type defined above. A possibly counter-intuitive behavior is that while

Nit: I'd strike A possibly counter-intuitive behavior is that... Too many words for too little. It's not even that unintuitive, which you have to say "possibly" :P

Who doesn't count towards the quota? Learners? If so, consider adding a note to them above saying that. But if that's not true, consider putting a similar comment about how they do count.


pkg/roachpb/metadata_replicas.go, line 211 at r10 (raw file):

// the learner has received a snapshot, it's considered by the quota pool that
// prevents the raft leader from getting too far ahead of the followers. This is
// because, currently, we only support ephemeral learners and they (especially

why are you saying "currently" and "only support"? Seems like we're getting to the end-state - we have learners (which are ephemeral) and non-voters (persistent)


pkg/roachpb/metadata_replicas.go, line 224 at r10 (raw file):

// different from learners in that the learners are a temporary internal
// intermediate state used to make atomic replication changes less disruptive to
// the system, whereas non-voting replicas, even though they are both internally

nit: too many twists and turns in this phrase; it doesn't read well. Consider moving "even though they are both internally etcd learners in the raft group" to a separate sentence at the end.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: outside of the snapshot type enum. I had some questions about that as well.

Reviewed 54 of 67 files at r7, 28 of 35 files at r8, 17 of 20 files at r9, 27 of 27 files at r10.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/metrics.go, line 447 at r10 (raw file):

	}
	metaRangeSnapshotsRaftAppliedForVoter = metric.Metadata{
		Name:        "range.snapshots.raft-applied-for-voter",

Should these just be:

  • range.snapshots.applied-voter
  • range.snapshots.applied-non-voter
  • range.snapshots.applied-learner

I don't quite see the pattern in what you have here. Are you trying to indicate that some of these are "raft" snapshots and some aren't? Is that useful to indicate in the metric?


pkg/kv/kvserver/raft.proto, line 141 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put comments on these. In particular mention that LEARNER_UPREPLICATE implies something about some queue delivering the snapshot?

I'm also hoping that we don't need a new type - particularly since I don't really know what these types are for in the first place.

Agreed. I would take a look into whether this change is needed.


pkg/kv/kvserver/replica_command.go, line 1150 at r10 (raw file):

) error {
	isAddition := func(chg roachpb.ReplicaChangeType) bool {
		if chg == roachpb.ADD_VOTER || chg == roachpb.ADD_NON_VOTER {

two nits:

First, what do you think of making these IsAddition() and IsRemoval() methods?

Second, we might want to make these checks exhaustive to avoid bugs if/when a new ReplicaChangeType is added. So for instance, use a switch-case and assert that nothing falls through.


pkg/kv/kvserver/replica_command.go, line 1261 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why did you change this name? The previous one seemed better to me

I think the idea is to differentiate this from adding a "learner" replica at the ReplicaType level.


pkg/kv/kvserver/replica_command.go, line 1286 at r10 (raw file):

			return nil, err
		}
		if typ == internalChangeTypeAddNonVoter {

Do you think this belongs here, or after the internalChangeTypeAddNonVoter caller in changeReplicasImpl?


pkg/kv/kvserver/replica_command.go, line 1288 at r10 (raw file):

		if typ == internalChangeTypeAddNonVoter {
			// If we added a non-voter, queue it up into the raft snapshot queue so
			// that it receives its first LEARNER_UPREPLICATE snapshot relatively soon.

You have a good comment in metadata_replica.go that mentions why we do this for non-voters but not learners. It's because we need to synchronously wait for upreplication for learners because upreplication is part of a larger progression of changes in Replica.atomicReplicationChange. This is not true for non-voting replica addition. Wherever this logic ends up, we should mention this explanation.


pkg/kv/kvserver/replica_command.go, line 1289 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment makes it sound like the type LEARNER_UPREPLICATE is a big deal, but in reality it's just some crap we plumb through for some counters, right? If so, I wouldn't refer to it too much.

Agreed.


pkg/kv/kvserver/replica_command.go, line 1486 at r10 (raw file):

	// voter with them), see:
	// https://github.com/cockroachdb/cockroach/pull/40268
	internalChangeTypeRemoveLearnerOrVoter

I'm surprised that Learner and Voter are the same type, but Non-Voter is separate. Why is that? Does it need to be?

Maybe we just keep this as internalChangeTypeRemove and use it for all removals? Or split out all three types?


pkg/kv/kvserver/replica_gc_queue.go, line 65 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If you ask me, I don't think so.
If it wasn't a real question, a question TODO seems weird.

Agreed. I don't think the difference between learners and voters in the replica GC queue is particularly interesting.


pkg/kv/kvserver/replica_test.go, line 6437 at r10 (raw file):

	defer log.Scope(t).Close(t)

	test := func(t *testing.T, changeType roachpb.ReplicaChangeType) {

nit: any reason to extract this instead of inlining it in the t.Run function?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

We should be able to remove all the migration logic from 19.1 to 19.2, except for stuff that's downstream of raft proposals, right? I've removed the logic that handled 19.1 proposals from here.

Yeah, this seems good. We probably don't even need the desc variable. No-one should be mutating ChangeReplicasTrigger.Desc.


pkg/roachpb/metadata.proto, line 95 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

wanna spell out the differences from LEARNER more?

+1


pkg/roachpb/metadata_replicas.go, line 228 at r10 (raw file):

// explicitly chosen to be placed in specific localities via zone configs.
//
// Key differences between how we treat (ephemeral) learners and non-voting replicas:

and (persistent) non-voting replicas

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from 55eca68 to ec1d40c Compare October 29, 2020 14:57
@aayushshah15 aayushshah15 requested a review from a team October 29, 2020 14:57
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 4 times, most recently from 1a7e12f to 31ded4a Compare October 29, 2020 16:51
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/metrics.go, line 447 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should these just be:

  • range.snapshots.applied-voter
  • range.snapshots.applied-non-voter
  • range.snapshots.applied-learner

I don't quite see the pattern in what you have here. Are you trying to indicate that some of these are "raft" snapshots and some aren't? Is that useful to indicate in the metric?

Done.


pkg/kv/kvserver/raft.proto, line 141 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. I would take a look into whether this change is needed.

We seem to use the phrase "raft snapshot" in a bunch of places to indicate "snapshot sent via the snapshot queue", which I don't think any of us really like.

I changed the SnapshotRequest_Type enum to indicate "snapshot sent via snapshot queue" or "initial learner snapshot", since that's really the piece of info that's only available on the sender side. On the receiver side, I'm checking if the replica is marked as uninitialized and handling the metrics using that. Do you all like this?


pkg/kv/kvserver/raft_snapshot_queue.go, line 113 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is this comment wrapped?

I removed the whole comment, it didnt add anything and was confusing.


pkg/kv/kvserver/raft_snapshot_queue.go, line 114 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I would s/SnapshotRequest_LEARNER_UPREPLICATE/LEARNER_UPREPLICATE

Fixed


pkg/kv/kvserver/raft_snapshot_queue.go, line 116 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/RAFT/VOTER_RAFT ?

N/A


pkg/kv/kvserver/raft_snapshot_queue.go, line 120 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Do we want this knob to apply to non-voters? It's only used by one test, so I don't think so.

Fixed


pkg/kv/kvserver/raft_snapshot_queue.go, line 123 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is this repeating the comment from above? If so, I think this is indeed the better place for it.

Fixed


pkg/kv/kvserver/replica_command.go, line 1019 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: I think this call fits on one line instead of 3

Fixed


pkg/kv/kvserver/replica_command.go, line 1150 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

two nits:

First, what do you think of making these IsAddition() and IsRemoval() methods?

Second, we might want to make these checks exhaustive to avoid bugs if/when a new ReplicaChangeType is added. So for instance, use a switch-case and assert that nothing falls through.

Done.


pkg/kv/kvserver/replica_command.go, line 1286 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think this belongs here, or after the internalChangeTypeAddNonVoter caller in changeReplicasImpl?

Placed it outside.


pkg/kv/kvserver/replica_command.go, line 1288 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You have a good comment in metadata_replica.go that mentions why we do this for non-voters but not learners. It's because we need to synchronously wait for upreplication for learners because upreplication is part of a larger progression of changes in Replica.atomicReplicationChange. This is not true for non-voting replica addition. Wherever this logic ends up, we should mention this explanation.

Done


pkg/kv/kvserver/replica_command.go, line 1289 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed.

Fixed.


pkg/kv/kvserver/replica_command.go, line 1486 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm surprised that Learner and Voter are the same type, but Non-Voter is separate. Why is that? Does it need to be?

Maybe we just keep this as internalChangeTypeRemove and use it for all removals? Or split out all three types?

Using internalChangeTypeRemove for all removals now, we could revisit if we decide to do what #40268 tried to do.


pkg/kv/kvserver/replica_command.go, line 1577 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how come we don't care about the ok ret val? Should we be asserting it to be true? (same in the existing code above)

I think it's because we've just called GetReplicaDescriptor right above it.


pkg/kv/kvserver/replica_command.go, line 1734 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this structure seems weird to me. As long as you're taking the blame, consider extracting an anonymous function and calling it twice (instead of this loop).

Extracted into separate non-anonymous function since it seemed cleaner.


pkg/kv/kvserver/replica_gc_queue.go, line 65 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. I don't think the difference between learners and voters in the replica GC queue is particularly interesting.

👍


pkg/kv/kvserver/replica_raft.go, line 480 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

spurious diff?

The comment is just really badly wrapped, reverted anyway.


pkg/kv/kvserver/replica_test.go, line 6437 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: any reason to extract this instead of inlining it in the t.Run function?

Nope, inlined.


pkg/roachpb/data.go, line 1631 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What will it take to support this? What happens after this PR if we try to make such a change?

It should be a simple ConfChange, the way it currently is for LEARNERs.


pkg/roachpb/data.go, line 1633 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this seems to me like a very random place to put such a high-level comment. If you want it, consider putting it around lease transfers - e.g. in findTargetAndTransferLease.

Moved it over TransferLeaseTarget


pkg/roachpb/data.go, line 1634 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/the lease to the/the lease to be

Done.


pkg/roachpb/metadata.proto, line 86 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/learners/Learners

Done


pkg/roachpb/metadata.proto, line 90 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

applied committed entries, but does not ...

Done


pkg/roachpb/metadata.proto, line 95 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

+1

Done, let me know if this still lacking.


pkg/roachpb/metadata.proto, line 96 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Nit: I'd strike A possibly counter-intuitive behavior is that... Too many words for too little. It's not even that unintuitive, which you have to say "possibly" :P

Who doesn't count towards the quota? Learners? If so, consider adding a note to them above saying that. But if that's not true, consider putting a similar comment about how they do count.

Fixed.


pkg/roachpb/metadata_replicas.go, line 211 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why are you saying "currently" and "only support"? Seems like we're getting to the end-state - we have learners (which are ephemeral) and non-voters (persistent)

Fixed.


pkg/roachpb/metadata_replicas.go, line 224 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: too many twists and turns in this phrase; it doesn't read well. Consider moving "even though they are both internally etcd learners in the raft group" to a separate sentence at the end.

Fixed


pkg/roachpb/metadata_replicas.go, line 228 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

and (persistent) non-voting replicas

Done.

Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is there a way for me to detect, on the sender side, if the replica receiving the snapshot has any data?

Doesn't repl.RaftStatus().Progress have what you need? I think Match == 0?

I find this snapshot types pretty confusing so I think it'd be great if we can get rid of them, or at least not add one more. In fact, coming back after looking at more code, I don't understand anything about them. I assumed they're used for sender-side metrics, but they seem to be used for metrics on the receiver. If so, is that sane? Doesn't the receiver know enough about what its state is to not require any collaboration from the sender? Isn't it trivial for the receiver to do what Nathan is asking for above - disambiguate between an initial snapshot and a non-initial one?
As far as receiver-side metrics go, I think we should have a counter of non-initial snapshots applied. I don't think I care about any metrics besides that.

See my comment about this above.

Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ready for another look!

Side note: This PR is getting to be a huge pain to rebase. If there are only nits left, I'd prefer to clean it up in a new PR, if that's okay.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch 3 times, most recently from bea3432 to fb3c8ef Compare October 30, 2020 00:03
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: except for the new IsRangeInitialized commit. See my comment inline.

Andrei should also give this one more look.

Reviewed 16 of 61 files at r11, 15 of 34 files at r12, 14 of 20 files at r13, 17 of 20 files at r14, 28 of 28 files at r15.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/replica_init.go, line 211 at r15 (raw file):

// another node. It is false when a range has been created in response
// to an incoming message but we are waiting for our initial snapshot.
func (r *Replica) IsRangeInitialized() bool {

(like we discussed on Slack) I don't think this is the right change. A "range" is never uninitialized – only a "replica" ever is. So this should either be IsReplicaInitialized or left as is (because this is a method on Replica) with an update to the comment to avoid ambiguity.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, this seems good. We probably don't even need the desc variable. No-one should be mutating ChangeReplicasTrigger.Desc.

Should we inline this?

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

+1'ed two of Nathan's nits

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/raft.proto, line 141 at r10 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

We seem to use the phrase "raft snapshot" in a bunch of places to indicate "snapshot sent via the snapshot queue", which I don't think any of us really like.

I changed the SnapshotRequest_Type enum to indicate "snapshot sent via snapshot queue" or "initial learner snapshot", since that's really the piece of info that's only available on the sender side. On the receiver side, I'm checking if the replica is marked as uninitialized and handling the metrics using that. Do you all like this?

I guess so... I don't really know how this is used. Please put some comments both about what the values mean, and a hint to what they're used for.


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

See my comment about this above.

which comment?


pkg/kv/kvserver/replica_command.go, line 1254 at r15 (raw file):

func addRaftLearners(
	ctx context.Context,
	r *Replica,

Between a store and a replica, I think it was better when this was taking a store, since that's that it appears to be need.

But it appears to me that a store is not actually needed. As far as I can see, we're only using it to get some testing knob and a cluster setting. If you want to change something, I'd look into passing those directly (maybe in a new struct), and removing the store dependency.


pkg/kv/kvserver/replica_init.go, line 211 at r15 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(like we discussed on Slack) I don't think this is the right change. A "range" is never uninitialized – only a "replica" ever is. So this should either be IsReplicaInitialized or left as is (because this is a method on Replica) with an update to the comment to avoid ambiguity.

+1

What's the thinking here?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we inline this?

+1, I don't think this variable helps anybody

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from fb3c8ef to adece46 Compare November 5, 2020 19:43
Copy link
Copy Markdown
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/raft.proto, line 141 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I guess so... I don't really know how this is used. Please put some comments both about what the values mean, and a hint to what they're used for.

Done.


pkg/kv/kvserver/raft_snapshot_queue.go, line 126 at r6 (raw file):

We seem to use the phrase "raft snapshot" in a bunch of places to indicate "snapshot sent via the snapshot queue", which I don't think any of us really like.
I changed the SnapshotRequest_Type enum to indicate "snapshot sent via snapshot queue" or "initial learner snapshot", since that's really the piece of info that's only available on the sender side. On the receiver side, I'm checking if the replica is marked as uninitialized and handling the metrics using that. Do you all like this?


pkg/kv/kvserver/replica_command.go, line 1254 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Between a store and a replica, I think it was better when this was taking a store, since that's that it appears to be need.

But it appears to me that a store is not actually needed. As far as I can see, we're only using it to get some testing knob and a cluster setting. If you want to change something, I'd look into passing those directly (maybe in a new struct), and removing the store dependency.

Are you referring to addRaftLearners or atomicReplicationChange? Looks like you were referring to the latter, which does seem to need the store dependency.


pkg/kv/kvserver/replica_init.go, line 211 at r15 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1

What's the thinking here?

I misunderstood the flow around the current IsInitialized method on the replica and the range. Reverted that commit.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1102 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1, I don't think this variable helps anybody

fixed.

@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from adece46 to d80bc71 Compare November 5, 2020 20:32
@aayushshah15
Copy link
Copy Markdown
Contributor Author

TFTR!

Going to optimistically merge it now. If there are other nits I'll fix em in a new PR.

bors r=nvanbenschoten,andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 5, 2020

Build failed:

This commit introduces the protobuf definitions, and the accompanying
generated code, for the new `NON_VOTER` Replica type.

Release note: None
This commit renames the `ReplicaChangeType` enums values from
`ADD_REPLICA` and `REMOVE_REPLICA` to `ADD_VOTER` and `REMOVE_VOTER`
respectively. Likewise, it also renames `internalChangeType` names to
be more specific.

This paves the way for new additions to this enum for types pertaining
to `NON_VOTER` replicas.

Release note: None
Likewise for the `RemoveReplicas` methods.

Release note: None
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from d80bc71 to f480988 Compare November 5, 2020 22:02
This commit adds the pieces necessary to allow creation (and removal) of
non voting replicas via an `AdminChangeReplicas` request.

Note that, unlike (ephemeral) learners, non voting replicas rely on the
raft snapshot queue for their initial upreplication. In order to speed
this up, we separately queue up the range leaseholder into the raft
snapshot queue after executing the learner's `ConfChange`. Additionally,
the replicate queue does not clean these non-voters up, like it does
ephemeral learners.

Release note: None
@aayushshah15 aayushshah15 force-pushed the introduce_persistent_learners branch from f480988 to 2988709 Compare November 6, 2020 15:45
@aayushshah15
Copy link
Copy Markdown
Contributor Author

fixed CI

bors r=nvanbenschoten,andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 6, 2020

Build succeeded:

@craig craig bot merged commit 1526a1c into cockroachdb:master Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants