storage: be more resilient to learner snap conflicts#40435
storage: be more resilient to learner snap conflicts#40435craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
How much do we care about the race when the replica addition is being run from a different node than the raft snapshot queue? If we do, I've spent so long in trying to do this somewhat gracefully and hitting issue after issue that my opinion at this point in the release is we should simply sniff the error in a retry loop. |
tbg
left a comment
There was a problem hiding this comment.
How much do we care about the race when the replica addition is being run from a different node than the raft snapshot queue? If we do, I've spent so long in trying to do this somewhat gracefully and hitting issue after issue that my opinion at this point in the release is we should simply sniff the error in a retry loop.
What you have here is good. Solving this race in all generality is not worth it at this point.
Reviewed 12 of 12 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
pkg/storage/replica_command.go, line 960 at r1 (raw file):
_ = r.atomicReplicationChange releaseSnapshotLockFn := r.lockLearnerSnapshot(ctx, adds) defer releaseSnapshotLockFn()
Can you release this once addLearnerReplicas returns? I don't want to end up in a situation in which we send the snap, but then the learner for whatever reason needs another snap after it has converted to voter (in atomicReplicationChange) below, but we're blocking the queue and deadlocking things.
Oh, I'm seeing that the lock has a timeout. Anyway, it'd be cleaner to release this sooner. You can just move the unlock before the if err != nil.
pkg/storage/replica_raftstorage.go, line 391 at r1 (raw file):
r.mu.Lock() appliedIndex := r.mu.state.RaftAppliedIndex // cleared when OutgoingSnapshot closes
nit: // Cleared when OutgoingSnapshot closes.
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz)
The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes cockroachdb#40207 Release note: None
danhhz
left a comment
There was a problem hiding this comment.
TFTR!
The test failure was a result of a String method changing in the raft bump, so updated the expected values in the test.
bors r=tbg
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/storage/replica_command.go, line 960 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you release this once
addLearnerReplicasreturns? I don't want to end up in a situation in which we send the snap, but then the learner for whatever reason needs another snap after it has converted to voter (inatomicReplicationChange) below, but we're blocking the queue and deadlocking things.Oh, I'm seeing that the lock has a timeout. Anyway, it'd be cleaner to release this sooner. You can just move the unlock before the
if err != nil.
As discussed offline, added more commentary instead of moving the unlock
pkg/storage/replica_raftstorage.go, line 391 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: // Cleared when OutgoingSnapshot closes.
Done
40435: storage: be more resilient to learner snap conflicts r=tbg a=danhhz The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes #40207 Release note: None 40454: kv: improve unit testing around txnSpanRefresher r=nvanbenschoten a=nvanbenschoten This was decently tested through its inclusion in the TxnCoordSender, but since I'm planning on changing some behavior to address #36431, I figured we should first build up a test suite in the same style that we have for other transaction interceptors. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
also send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.
After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.
Closes #40207
Release note: None