Skip to content

storage: be more resilient to learner snap conflicts#40435

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
danhhz:learner_snap
Sep 4, 2019
Merged

storage: be more resilient to learner snap conflicts#40435
craig[bot] merged 1 commit intocockroachdb:masterfrom
danhhz:learner_snap

Conversation

@danhhz
Copy link
Copy Markdown
Contributor

@danhhz danhhz commented Sep 3, 2019

The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
also send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.

After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.

Closes #40207

Release note: None

@danhhz danhhz requested a review from tbg September 3, 2019 18:56
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Sep 3, 2019

How much do we care about the race when the replica addition is being run from a different node than the raft snapshot queue? If we do, I've spent so long in trying to do this somewhat gracefully and hitting issue after issue that my opinion at this point in the release is we should simply sniff the error in a retry loop.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much do we care about the race when the replica addition is being run from a different node than the raft snapshot queue? If we do, I've spent so long in trying to do this somewhat gracefully and hitting issue after issue that my opinion at this point in the release is we should simply sniff the error in a retry loop.

What you have here is good. Solving this race in all generality is not worth it at this point.

Reviewed 12 of 12 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/storage/replica_command.go, line 960 at r1 (raw file):

		_ = r.atomicReplicationChange
		releaseSnapshotLockFn := r.lockLearnerSnapshot(ctx, adds)
		defer releaseSnapshotLockFn()

Can you release this once addLearnerReplicas returns? I don't want to end up in a situation in which we send the snap, but then the learner for whatever reason needs another snap after it has converted to voter (in atomicReplicationChange) below, but we're blocking the queue and deadlocking things.

Oh, I'm seeing that the lock has a timeout. Anyway, it'd be cleaner to release this sooner. You can just move the unlock before the if err != nil.


pkg/storage/replica_raftstorage.go, line 391 at r1 (raw file):

	r.mu.Lock()
	appliedIndex := r.mu.state.RaftAppliedIndex
	// cleared when OutgoingSnapshot closes

nit: // Cleared when OutgoingSnapshot closes.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz)

The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
_also_ send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.

After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.

Closes cockroachdb#40207

Release note: None
Copy link
Copy Markdown
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

The test failure was a result of a String method changing in the raft bump, so updated the expected values in the test.

bors r=tbg

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/storage/replica_command.go, line 960 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you release this once addLearnerReplicas returns? I don't want to end up in a situation in which we send the snap, but then the learner for whatever reason needs another snap after it has converted to voter (in atomicReplicationChange) below, but we're blocking the queue and deadlocking things.

Oh, I'm seeing that the lock has a timeout. Anyway, it'd be cleaner to release this sooner. You can just move the unlock before the if err != nil.

As discussed offline, added more commentary instead of moving the unlock


pkg/storage/replica_raftstorage.go, line 391 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: // Cleared when OutgoingSnapshot closes.

Done

craig bot pushed a commit that referenced this pull request Sep 4, 2019
40435: storage: be more resilient to learner snap conflicts r=tbg a=danhhz

The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
_also_ send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.

After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.

Closes #40207

Release note: None

40454: kv: improve unit testing around txnSpanRefresher r=nvanbenschoten a=nvanbenschoten

This was decently tested through its inclusion in the TxnCoordSender,
but since I'm planning on changing some behavior to address #36431, I
figured we should first build up a test suite in the same style that
we have for other transaction interceptors.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 4, 2019

Build succeeded

@craig craig bot merged commit 20ba7bc into cockroachdb:master Sep 4, 2019
@danhhz danhhz deleted the learner_snap branch September 5, 2019 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: learner snap can fail when interacting with raft snap

3 participants