storage: check for RaftTombstone again once synchronized with Store#41050
storage: check for RaftTombstone again once synchronized with Store#41050craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/store.go, line 4003 at r1 (raw file):
} // No replica currently exists, so we'll try to create one. Before creating
This is really just a non-authoritative check, right? Shouldn't things just work if you remove it, and would it perhaps be better to do that? (Perhaps not at this point in the cycle, but morally speaking).
pkg/storage/store.go, line 4059 at r1 (raw file):
} // An uninitiazlied replica should have an empty HardState.Commit at
(old) typo (side discussion: should we stop pointing out harmless typos in code? It's just wasting everyone's time.)
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/store.go, line 4003 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is really just a non-authoritative check, right? Shouldn't things just work if you remove it, and would it perhaps be better to do that? (Perhaps not at this point in the cycle, but morally speaking).
It's probably a good optimization. Imagine we just got removed from a range but there's an old follower which has queued a bunch of raft messages to send to us. Do we really want to create a new replica object, lock the store, put it in the map, then remove it for every one of those messages? Maybe add a comment that this is an optimization to avoid creating replicas we know we'll have to destroy.
pkg/storage/store.go, line 4059 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(old) typo (side discussion: should we stop pointing out harmless typos in code? It's just wasting everyone's time.)
I'm cool with your pointing out the typos. Maybe with the reinforcement I'll get better at spelling.
Found by ajwerner's `pkg/storage` stressing.
`TestReplicaGCRace` was occasionally failing with the crash:
```
F190924 21:08:14.747519 710 vendor/go.etcd.io/etcd/raft/log.go:203 [s3,r1/3:{-}] tocommit(20) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 710 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000432301, 0xc000432300, 0x0, 0xc00030ae48)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x63cdc60, 0xc000000004, 0x5dedf9a, 0x22, 0xcb, 0xc00135e1c0, 0x6a)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40609e0, 0xc000d05b60, 0x4, 0x3, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40609e0, 0xc000d05b60, 0x2, 0xc000000004, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.FatalfDepth(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:199
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc000429f60, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:102 +0xf1
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0005fb7a0, 0x14)
/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/log.go:203 +0x131
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1396 +0x54
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.stepFollower(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1341 +0x480
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).Step(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:984 +0x1113
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*RawNode).Step(0xc000d64ac0, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/rawnode.go:114 +0xcc
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc000d64ac0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:396 +0x108
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1362
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc000588800, 0x38d2900, 0xc000c15778, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1363 +0x79
```
This was because a call to `tryGetOrCreateReplica` was racing with a snapshot.
The call to `tryGetOrCreateReplica` would check the raft tombstone and then
stall. The snapshot would be processed and the test would quickly remove the
replica, leaving a raft tombstone. The call to `tryGetOrCreateReplica` would
then continue without ever looking for the raft tombstone again. It would then
find an unexpected HardState and fatal.
This can be made much more frequent with the patch:
```
diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 99fa192..0f6e47e 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -4014,6 +4014,7 @@ func (s *Store) tryGetOrCreateReplica(
return nil, false, &roachpb.RaftGroupDeletedError{}
}
}
+ time.Sleep(100 * time.Millisecond)
// Create a new replica and lock it for raft processing.
repl := newReplica(rangeID, s)
```
The fix for this is to check the raft tombstone again after inserting into the
Store's Range map. This double-checked locking ensures that we make sure to
synchronize with any goroutine that wrote a tombstone and then removed an old
replica from the Range map.
Release justification: Fixes a crash due to a race between setting and
retrieving a Range's RaftTombstone.
Release note: None
605d768 to
369fe76
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/storage/store.go, line 4003 at r1 (raw file):
Previously, ajwerner wrote…
It's probably a good optimization. Imagine we just got removed from a range but there's an old follower which has queued a bunch of raft messages to send to us. Do we really want to create a new replica object, lock the store, put it in the map, then remove it for every one of those messages? Maybe add a comment that this is an optimization to avoid creating replicas we know we'll have to destroy.
Yeah, exactly that. Added a note.
pkg/storage/store.go, line 4059 at r1 (raw file):
Previously, ajwerner wrote…
I'm cool with your pointing out the typos. Maybe with the reinforcement I'll get better at spelling.
Done. I'm cool with it too.
|
bors r+ |
41050: storage: check for RaftTombstone again once synchronized with Store r=nvanbenschoten a=nvanbenschoten Found by @ajwerner's `pkg/storage` stressing. `TestReplicaGCRace` was occasionally failing with the crash: ``` F190924 21:08:14.747519 710 vendor/go.etcd.io/etcd/raft/log.go:203 [s3,r1/3:{-}] tocommit(20) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? goroutine 710 [running]: github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000432301, 0xc000432300, 0x0, 0xc00030ae48) /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1 github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x63cdc60, 0xc000000004, 0x5dedf9a, 0x22, 0xcb, 0xc00135e1c0, 0x6a) /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93a github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40609e0, 0xc000d05b60, 0x4, 0x3, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2) /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40609e0, 0xc000d05b60, 0x2, 0xc000000004, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2) /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c github.com/cockroachdb/cockroach/pkg/util/log.FatalfDepth(...) /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:199 github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc000429f60, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2) /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:102 +0xf1 github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0005fb7a0, 0x14) /go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/log.go:203 +0x131 github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1396 +0x54 github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.stepFollower(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1341 +0x480 github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).Step(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:984 +0x1113 github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*RawNode).Step(0xc000d64ac0, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/rawnode.go:114 +0xcc github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc000d64ac0, 0x0, 0x0, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:396 +0x108 github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(...) /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1362 github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc000588800, 0x38d2900, 0xc000c15778, 0x0, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1363 +0x79 ``` This was because a call to `tryGetOrCreateReplica` was racing with a snapshot. The call to `tryGetOrCreateReplica` would check the raft tombstone and then stall. The snapshot would be processed and the test would quickly remove the replica, leaving a raft tombstone. The call to `tryGetOrCreateReplica` would then continue without ever looking for the raft tombstone again. It would then find an unexpected HardState and fatal. This can be made much more frequent with the patch: ``` diff --git a/pkg/storage/store.go b/pkg/storage/store.go index 99fa192..0f6e47e 100644 --- a/pkg/storage/store.go +++ b/pkg/storage/store.go @@ -4014,6 +4014,7 @@ func (s *Store) tryGetOrCreateReplica( return nil, false, &roachpb.RaftGroupDeletedError{} } } + time.Sleep(100 * time.Millisecond) // Create a new replica and lock it for raft processing. repl := newReplica(rangeID, s) ``` The fix for this is to check the raft tombstone again after inserting into the Store's Range map. This double-checked locking ensures that we make sure to synchronize with any goroutine that wrote a tombstone and then removed an old replica from the Range map. Release justification: Fixes a crash due to a race between setting and retrieving a Range's RaftTombstone. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
Found by @ajwerner's
pkg/storagestressing.TestReplicaGCRacewas occasionally failing with the crash:This was because a call to
tryGetOrCreateReplicawas racing with a snapshot.The call to
tryGetOrCreateReplicawould check the raft tombstone and thenstall. The snapshot would be processed and the test would quickly remove the
replica, leaving a raft tombstone. The call to
tryGetOrCreateReplicawouldthen continue without ever looking for the raft tombstone again. It would then
find an unexpected HardState and fatal.
This can be made much more frequent with the patch:
The fix for this is to check the raft tombstone again after inserting into the
Store's Range map. This double-checked locking ensures that we make sure to
synchronize with any goroutine that wrote a tombstone and then removed an old
replica from the Range map.
Release justification: Fixes a crash due to a race between setting and
retrieving a Range's RaftTombstone.
Release note: None