storage: avoid (one) fatal error from splitPostApply by tbg · Pull Request #39796 · cockroachdb/cockroach

tbg · 2019-08-21T15:42:06Z

This is the next band-aid on top of #39658 and #39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

F190821 15:14:28.241623 312191 storage/store.go:2172
[n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
found in right hand side of split

I saw this randomly in make test PKG=./pkg/ccl/partitionccl.

@danhhz could you take a stab at replicating this in the test you added
recently? It's probably too ambitious but it'd be nice to verify this is
actually fixed (I didn't). I haven't seen this fail on master (though I also
didn't look very hard) so maybe this is rare (i.e. we have time), but then I
just got it doing nothing related in particular.

Release note: None

cockroach-teamcity · 2019-08-21T15:42:12Z

This change is

danhhz · 2019-08-22T19:49:40Z

@danhhz could you take a stab at replicating this in the test you added
recently? It's probably too ambitious but it'd be nice to verify this is
actually fixed (I didn't). I haven't seen this fail on master (though I also
didn't look very hard) so maybe this is rare (i.e. we have time), but then I
just got it doing nothing related in particular.

I wouldn't count on me getting a chance to update the test anytime soon, unfortunately. I can't say I particularly understand the subtitles involved in this bit of code yet, so if there's no hurry, I'm inclined to skip the sciencedog review and wait for someone that does understand them to look at it.

tbg · 2019-09-04T16:30:41Z

Slightly different crash seen by @knz in #40470

I190904 14:33:33.261425 180 server/status/runtime.go:498  [n1] runtime stats: 4.7 GiB RSS, 407 goroutines, 2.6 GiB/718 MiB/3.4 GiB GO alloc/idle/total, 1.1 GiB/1.3 GiB CGO alloc/total, 210805.4 CGO/sec, 381.5/16.5 %(u/s)time, 0.0 %gc (3x),
 223 MiB/85 MiB (r/w)net
I190904 14:33:33.964062 11702 storage/replica_raft.go:291  [n1,s1,r261/1:/Table/56{-/1}] proposing REMOVE_REPLICA[(n4,s4):3]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):5] next=6
W190904 14:33:34.137913 11847 storage/replica_raft.go:105  [n1,s1,r323/1:/Table/60/1/2{633/2/…-766/4/…}] context canceled before proposing: 1 HeartbeatTxn
I190904 14:33:34.489669 12200 storage/replica_command.go:1521  [n1,replicate,s1,r247/1:/{Table/61/3-Max}] change replicas (add [] remove [(n3,s3):2]): existing descriptor r247:/{Table/61/3-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, (n4,s4):5,
next=6, gen=20]
I190904 14:33:34.669612 11610 storage/replica_raftstorage.go:793  [n1,s1,r313/4:{-}] applying LEARNER snapshot [id=faf20096 index=15]
I190904 14:33:34.944105 11610 storage/replica_raftstorage.go:814  [n1,s1,r313/4:/Table/58/1/{2525/9…-3750/1…}] applied LEARNER snapshot [total=274ms ingestion=4@217ms id=faf20096 index=15]
I190904 14:33:35.053208 12088 storage/split_queue.go:149  [n1,split,s1,r307/1:/Table/54/1/125{3/119…-5/522…}] split saw concurrent descriptor modification; maybe retrying
W190904 14:33:35.202730 12090 storage/replica_raft.go:105  [n1,s1,r304/1:/Table/53/1/250{3/7/-…-5/3/-…}] context canceled before proposing: 1 HeartbeatTxn
I190904 14:33:35.232699 12190 storage/replica_command.go:395  [n1,split,s1,r307/1:/Table/54/1/125{3/119…-5/522…}] initiating a split of this range at key /Table/54/1/1254/11837 [r329] (77 MiB above threshold size 64 MiB)
F190904 14:33:35.276986 138 storage/store.go:2148  [n1,s1,r313/4:/Table/58/1/{2525/9…-3750/1…}] split trigger found right-hand side with tombstone {NextReplicaID:5}: [n1,s1,r316/?:{-}]
goroutine 138 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000448301, 0xc000448360, 0x0, 0x7c662a)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7c04a40, 0xc000000004, 0x73d2595, 0x10, 0x864, 0xc0006e43c0, 0x89)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e5e460, 0xc03ccc99e0, 0x4, 0x2, 0x4563081, 0x3a, 0xc003394730, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e5e460, 0xc03ccc99e0, 0x1, 0xc000000004, 0x4563081, 0x3a, 0xc003394730, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:180
github.com/cockroachdb/cockroach/pkg/storage.splitPostApply(0x4e5e460, 0xc03ccc99e0, 0x0, 0x15c142ce810fb293, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2148 +0xb3e

tbg · 2019-09-06T09:36:01Z

I added a test that reproduces this. @ajwerner, mind reviewing this now that you're dealing with similar problems?

Without the preceding commit, this fatals as seen in cockroachdb#39796: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split Note that cockroachdb#40459 will break these tests since it will not allow the replicaID to change without an intermittent GC cycle, but we now rely on catching up via the raft log across a removal, split, and then re-add. At that point, the only way in which a split trigger could execute while in non-member status is if a replica got removed but would still receive a later split trigger. This could happen as follows: The log is `[..., remove, split]` a) the follower receives `split` b) it learns that `split` is committed so that c) it will apply both `remove` and then `split`. a) seems possible; if the removed follower is trailing a bit, the leader may append both entries at the same time before it (the leader) applies the replication change that removes the follower from the leader's config. b) seems possible like above, at least once we do async command application, if said append carries a commit index encompassing the append itself. c) will then naturally happen in today's code. There's no way we can reasonably enact this scenario in our testing at this point, though. The suggestion for cockroachdb#40459 is to essentially revert this commit, so that we at least retain the coverage related to the RHS of this test. Release note (bug fix): Avoid crashes with message "replica descriptor of local store not found".

ajwerner

Reviewed 1 of 1 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/storage/store.go, line 2171 at r2 (raw file):

			// not in the exact scenario above) but it sounds like it could be
			// in general. So if it's zero, we leave it at zero and if it's not
			// zero we wind it back to one. The code below achieves this.

Can you note that this replica id of 1 is just used for in-memory Replica.mu.replicaID field and won't actually make it to raft? I was puzzled for a second by how it could be safe to just assume that you have replica id 1.

It seems from the test that 1 is a problematic value as it may overlap with an existing replica (see your test failure). What about using something else inside like math.MaxInt32?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/storage/store.go, line 2171 at r2 (raw file):

Previously, ajwerner wrote…

Can you note that this replica id of 1 is just used for in-memory Replica.mu.replicaID field and won't actually make it to raft? I was puzzled for a second by how it could be safe to just assume that you have replica id 1.

It seems from the test that 1 is a problematic value as it may overlap with an existing replica (see your test failure). What about using something else inside like math.MaxInt32?

Actually -1 seems better than MaxInt32

This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. There is one more bug here that is not addressed by this change, namely the fact that there might be a tombstone for the right hand side. This continues to be tracked in cockroachdb#40470. See also cockroachdb#40367 (comment) for more work we need to do to establish sanity. Release note: None

Without the preceding commit, this fatals as seen in cockroachdb#39796: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split Note that cockroachdb#40459 will break these tests since it will not allow the replicaID to change without an intermittent GC cycle, but we now rely on catching up via the raft log across a removal, split, and then re-add. At that point, the only way in which a split trigger could execute while in non-member status is if a replica got removed but would still receive a later split trigger. This could happen as follows: The log is `[..., remove, split]` a) the follower receives `split` b) it learns that `split` is committed so that c) it will apply both `remove` and then `split`. a) seems possible; if the removed follower is trailing a bit, the leader may append both entries at the same time before it (the leader) applies the replication change that removes the follower from the leader's config. b) seems possible like above, at least once we do async command application, if said append carries a commit index encompassing the append itself. c) will then naturally happen in today's code. There's no way we can reasonably enact this scenario in our testing at this point, though. The suggestion for cockroachdb#40459 is to essentially revert this commit, so that we at least retain the coverage related to the RHS of this test. Release note (bug fix): Avoid crashes with message "replica descriptor of local store not found".

tbg

@ajwerner could you give this another look and bors it for me? You should be able to push to the branch in case you're finding it necessary to amend

ajwerner

that explanation is clearer. I look forward to adding the assertion back.

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @tbg)

craig · 2019-09-06T17:35:15Z

Build failed

GitHub CI (Cockroach)

ajwerner · 2019-09-06T17:45:10Z

flaked on #39321

bors r+

@danhhz

39796: storage: avoid (one) fatal error from splitPostApply r=ajwerner a=tbg This is the next band-aid on top of #39658 and #39571. The descriptor lookup I added sometimes fails because replicas can process a split trigger in which they're not a member of the range: > F190821 15:14:28.241623 312191 storage/store.go:2172 > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not > found in right hand side of split I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`. @danhhz could you take a stab at replicating this in the test you added recently? It's probably too ambitious but it'd be nice to verify this is actually fixed (I didn't). I haven't seen this fail on master (though I also didn't look very hard) so maybe this is rare (i.e. we have time), but then I just got it doing nothing related in particular. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2019-09-06T18:18:18Z

Build succeeded

GitHub CI (Cockroach)

We've seen instability recently due to invariants being violated as replicas catch up across periods of being removed and re-added to a range. Due to learner replicas and their rollback behavior this is now a relatively common case. Rather than handle all of these various scenarios this PR prevents them from occuring by actively removing replicas when we determine that they must have been removed. Here's a high level overview of the change: * Once a Replica object has a non-zero Replica.mu.replicaID it will not change. * In this commit however, if a node crashes it may forget that it learned about a replica ID. * If a raft message or snapshot addressed to a higher replica ID is received the current replica will be removed completely. * If a replica sees a ChangeReplicasTrigger which removes it then it completely removes itself while applying that command. * Replica.mu.destroyStatus is used to meaningfully signify the removal state of a Replica. Replicas about to be synchronously removed are in destroyReasonRemovalPending. This hopefully gives us some new invariants: * There is only ever at most 1 Replica which IsAlive() for a range on a Store at a time. * Once a Replica has a non-zero ReplicaID is never changes. * This applies only to the in-memory object, not the store itself. * Once a Replica applies a command as a part of the range descriptor it will never apply another command as a different Replica ID or outside of the Range. * Corrolary: a Replica created as a learner will only ever apply commands while that replica is in the range. The change also introduces some new complexity. Namely we now allow removal of uninitialized replicas, including their hard state. This allows us to catch up across a split even when we know the RHS must have been removed. Fixes cockroachdb#40367. Issue cockroachdb#38772 (comment) manifests itself as the RHS not being found for a merge. This happens because the Replica is processing commands to catch itself up while it is not in the range. This is no longer possible. Fixes cockroachdb#40257. Issue cockroachdb#40257 is another case of a replica processing commands while it is not in the range. Fixes cockroachdb#40470. Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal prior to a LHS processing a split. This case is now handled properly and is tested. Release justification: This commit is safe for 19.2 because it fixes release blockers. Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes: cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS" cockroachdb#39796 "replica descriptor of local store not found in right hand side of split" cockroachdb#40470 "split trigger found right-hand side with tombstone" cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"

tbg requested a review from danhhz August 21, 2019 15:42

This was referenced Aug 27, 2019

secret core release blockers tbg/cockroach#163

Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #39005

Closed

andy-kimball mentioned this pull request Sep 4, 2019

19.2 release blockers list #40447

Closed

53 tasks

This was referenced Sep 4, 2019

storage: prepare for kv.atomic_replication_changes=true #40370

Merged

storage: splitPostApply can see tombstone for RHS #40470

Closed

tbg force-pushed the fix/split-trigger-non-member branch from 28d5d6e to 704162a Compare September 6, 2019 09:34

tbg changed the title ~~storage: avoid fatal error from splitPostApply~~ storage: avoid (one) fatal error from splitPostApply Sep 6, 2019

tbg force-pushed the fix/split-trigger-non-member branch 2 times, most recently from daafca6 to 68d6486 Compare September 6, 2019 12:15

ajwerner reviewed Sep 6, 2019

View reviewed changes

ajwerner approved these changes Sep 6, 2019

View reviewed changes

tbg added 2 commits September 6, 2019 17:27

tbg force-pushed the fix/split-trigger-non-member branch from 68d6486 to 4cdf0f9 Compare September 6, 2019 15:27

tbg commented Sep 6, 2019

View reviewed changes

ajwerner reviewed Sep 6, 2019

View reviewed changes

craig bot merged commit 4cdf0f9 into cockroachdb:master Sep 6, 2019

This was referenced Sep 10, 2019

ccl/partitionccl: TestInitialPartitioning failed under stress #40531

Closed

roachtest: tpccbench/nodes=12/cpu=16 failed #40535

Closed

nvb mentioned this pull request Sep 17, 2019

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #40561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid (one) fatal error from splitPostApply#39796

storage: avoid (one) fatal error from splitPostApply#39796
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:fix/split-trigger-non-member

tbg commented Aug 21, 2019

Uh oh!

cockroach-teamcity commented Aug 21, 2019

Uh oh!

danhhz commented Aug 22, 2019

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

tbg commented Sep 6, 2019

Uh oh!

ajwerner left a comment

Uh oh!

ajwerner left a comment

Uh oh!

tbg left a comment

Uh oh!

ajwerner left a comment

Uh oh!

craig bot commented Sep 6, 2019

Uh oh!

ajwerner commented Sep 6, 2019

Uh oh!

craig bot commented Sep 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tbg commented Aug 21, 2019

Uh oh!

cockroach-teamcity commented Aug 21, 2019

Uh oh!

danhhz commented Aug 22, 2019

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

tbg commented Sep 6, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Sep 6, 2019

Build failed

Uh oh!

ajwerner commented Sep 6, 2019

Uh oh!

craig bot commented Sep 6, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants