Skip to content

storage: check for RaftTombstone again once synchronized with Store#41050

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tombstoneRace
Sep 25, 2019
Merged

storage: check for RaftTombstone again once synchronized with Store#41050
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tombstoneRace

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Sep 24, 2019

Found by @ajwerner's pkg/storage stressing.

TestReplicaGCRace was occasionally failing with the crash:

F190924 21:08:14.747519 710 vendor/go.etcd.io/etcd/raft/log.go:203  [s3,r1/3:{-}] tocommit(20) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 710 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000432301, 0xc000432300, 0x0, 0xc00030ae48)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x63cdc60, 0xc000000004, 0x5dedf9a, 0x22, 0xcb, 0xc00135e1c0, 0x6a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40609e0, 0xc000d05b60, 0x4, 0x3, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40609e0, 0xc000d05b60, 0x2, 0xc000000004, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.FatalfDepth(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:199
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc000429f60, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:102 +0xf1
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0005fb7a0, 0x14)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/log.go:203 +0x131
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1396 +0x54
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.stepFollower(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1341 +0x480
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).Step(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:984 +0x1113
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*RawNode).Step(0xc000d64ac0, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/rawnode.go:114 +0xcc
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc000d64ac0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:396 +0x108
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1362
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc000588800, 0x38d2900, 0xc000c15778, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1363 +0x79

This was because a call to tryGetOrCreateReplica was racing with a snapshot.
The call to tryGetOrCreateReplica would check the raft tombstone and then
stall. The snapshot would be processed and the test would quickly remove the
replica, leaving a raft tombstone. The call to tryGetOrCreateReplica would
then continue without ever looking for the raft tombstone again. It would then
find an unexpected HardState and fatal.

This can be made much more frequent with the patch:

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 99fa192..0f6e47e 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -4014,6 +4014,7 @@ func (s *Store) tryGetOrCreateReplica(
                        return nil, false, &roachpb.RaftGroupDeletedError{}
                }
        }
+       time.Sleep(100 * time.Millisecond)

        // Create a new replica and lock it for raft processing.
        repl := newReplica(rangeID, s)

The fix for this is to check the raft tombstone again after inserting into the
Store's Range map. This double-checked locking ensures that we make sure to
synchronize with any goroutine that wrote a tombstone and then removed an old
replica from the Range map.

Release justification: Fixes a crash due to a race between setting and
retrieving a Range's RaftTombstone.

Release note: None

@nvb nvb requested a review from ajwerner September 24, 2019 22:13
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/store.go, line 4003 at r1 (raw file):

	}

	// No replica currently exists, so we'll try to create one. Before creating

This is really just a non-authoritative check, right? Shouldn't things just work if you remove it, and would it perhaps be better to do that? (Perhaps not at this point in the cycle, but morally speaking).


pkg/storage/store.go, line 4059 at r1 (raw file):

		}

		// An uninitiazlied replica should have an empty HardState.Commit at

(old) typo (side discussion: should we stop pointing out harmless typos in code? It's just wasting everyone's time.)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/store.go, line 4003 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is really just a non-authoritative check, right? Shouldn't things just work if you remove it, and would it perhaps be better to do that? (Perhaps not at this point in the cycle, but morally speaking).

It's probably a good optimization. Imagine we just got removed from a range but there's an old follower which has queued a bunch of raft messages to send to us. Do we really want to create a new replica object, lock the store, put it in the map, then remove it for every one of those messages? Maybe add a comment that this is an optimization to avoid creating replicas we know we'll have to destroy.


pkg/storage/store.go, line 4059 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(old) typo (side discussion: should we stop pointing out harmless typos in code? It's just wasting everyone's time.)

I'm cool with your pointing out the typos. Maybe with the reinforcement I'll get better at spelling.

Found by ajwerner's `pkg/storage` stressing.

`TestReplicaGCRace` was occasionally failing with the crash:

```
F190924 21:08:14.747519 710 vendor/go.etcd.io/etcd/raft/log.go:203  [s3,r1/3:{-}] tocommit(20) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 710 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000432301, 0xc000432300, 0x0, 0xc00030ae48)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x63cdc60, 0xc000000004, 0x5dedf9a, 0x22, 0xcb, 0xc00135e1c0, 0x6a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40609e0, 0xc000d05b60, 0x4, 0x3, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40609e0, 0xc000d05b60, 0x2, 0xc000000004, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.FatalfDepth(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:199
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc000429f60, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:102 +0xf1
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0005fb7a0, 0x14)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/log.go:203 +0x131
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1396 +0x54
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.stepFollower(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1341 +0x480
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).Step(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:984 +0x1113
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*RawNode).Step(0xc000d64ac0, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/rawnode.go:114 +0xcc
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc000d64ac0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:396 +0x108
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1362
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc000588800, 0x38d2900, 0xc000c15778, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1363 +0x79
```

This was because a call to `tryGetOrCreateReplica` was racing with a snapshot.
The call to `tryGetOrCreateReplica` would check the raft tombstone and then
stall. The snapshot would be processed and the test would quickly remove the
replica, leaving a raft tombstone. The call to `tryGetOrCreateReplica` would
then continue without ever looking for the raft tombstone again. It would then
find an unexpected HardState and fatal.

This can be made much more frequent with the patch:
```
diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 99fa192..0f6e47e 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -4014,6 +4014,7 @@ func (s *Store) tryGetOrCreateReplica(
                        return nil, false, &roachpb.RaftGroupDeletedError{}
                }
        }
+       time.Sleep(100 * time.Millisecond)

        // Create a new replica and lock it for raft processing.
        repl := newReplica(rangeID, s)
```

The fix for this is to check the raft tombstone again after inserting into the
Store's Range map. This double-checked locking ensures that we make sure to
synchronize with any goroutine that wrote a tombstone and then removed an old
replica from the Range map.

Release justification: Fixes a crash due to a race between setting and
retrieving a Range's RaftTombstone.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/tombstoneRace branch from 605d768 to 369fe76 Compare September 25, 2019 15:17
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/storage/store.go, line 4003 at r1 (raw file):

Previously, ajwerner wrote…

It's probably a good optimization. Imagine we just got removed from a range but there's an old follower which has queued a bunch of raft messages to send to us. Do we really want to create a new replica object, lock the store, put it in the map, then remove it for every one of those messages? Maybe add a comment that this is an optimization to avoid creating replicas we know we'll have to destroy.

Yeah, exactly that. Added a note.


pkg/storage/store.go, line 4059 at r1 (raw file):

Previously, ajwerner wrote…

I'm cool with your pointing out the typos. Maybe with the reinforcement I'll get better at spelling.

Done. I'm cool with it too.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Sep 25, 2019

bors r+

craig bot pushed a commit that referenced this pull request Sep 25, 2019
41050: storage: check for RaftTombstone again once synchronized with Store r=nvanbenschoten a=nvanbenschoten

Found by @ajwerner's `pkg/storage` stressing.

`TestReplicaGCRace` was occasionally failing with the crash:

```
F190924 21:08:14.747519 710 vendor/go.etcd.io/etcd/raft/log.go:203  [s3,r1/3:{-}] tocommit(20) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 710 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000432301, 0xc000432300, 0x0, 0xc00030ae48)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x63cdc60, 0xc000000004, 0x5dedf9a, 0x22, 0xcb, 0xc00135e1c0, 0x6a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40609e0, 0xc000d05b60, 0x4, 0x3, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40609e0, 0xc000d05b60, 0x2, 0xc000000004, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.FatalfDepth(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:199
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc000429f60, 0x382af56, 0x5d, 0xc000c81140, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:102 +0xf1
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0005fb7a0, 0x14)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/log.go:203 +0x131
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1396 +0x54
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.stepFollower(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:1341 +0x480
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*raft).Step(0xc0013cb680, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raft.go:984 +0x1113
github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft.(*RawNode).Step(0xc000d64ac0, 0x8, 0x3, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/rawnode.go:114 +0xcc
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc000d64ac0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:396 +0x108
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1362
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc000588800, 0x38d2900, 0xc000c15778, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1363 +0x79
```

This was because a call to `tryGetOrCreateReplica` was racing with a snapshot.
The call to `tryGetOrCreateReplica` would check the raft tombstone and then
stall. The snapshot would be processed and the test would quickly remove the
replica, leaving a raft tombstone. The call to `tryGetOrCreateReplica` would
then continue without ever looking for the raft tombstone again. It would then
find an unexpected HardState and fatal.

This can be made much more frequent with the patch:
```
diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 99fa192..0f6e47e 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -4014,6 +4014,7 @@ func (s *Store) tryGetOrCreateReplica(
                        return nil, false, &roachpb.RaftGroupDeletedError{}
                }
        }
+       time.Sleep(100 * time.Millisecond)

        // Create a new replica and lock it for raft processing.
        repl := newReplica(rangeID, s)
```

The fix for this is to check the raft tombstone again after inserting into the
Store's Range map. This double-checked locking ensures that we make sure to
synchronize with any goroutine that wrote a tombstone and then removed an old
replica from the Range map.

Release justification: Fixes a crash due to a race between setting and
retrieving a Range's RaftTombstone.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 25, 2019

Build succeeded

@craig craig bot merged commit 369fe76 into cockroachdb:master Sep 25, 2019
@nvb nvb deleted the nvanbenschoten/tombstoneRace branch October 17, 2019 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants