Skip to content

storage: remove the possibility of nil Replica.raftGroup#4204

Merged
bdarnell merged 1 commit intocockroachdb:masterfrom
bdarnell:raft-group-nil
Feb 9, 2016
Merged

storage: remove the possibility of nil Replica.raftGroup#4204
bdarnell merged 1 commit intocockroachdb:masterfrom
bdarnell:raft-group-nil

Conversation

@bdarnell
Copy link
Copy Markdown
Contributor

@bdarnell bdarnell commented Feb 7, 2016

Fixes #4101
Fixes #3896

Review on Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Feb 7, 2016

The failure mode is a removal happening, and the local data not being destroyed. Isn't the only mechanism by which a nil raftGroup comes into existence in NewStore which in turn creates a new replica with a descriptor that doesn't contain the replica?
Why doesn't that instead lead to the replica not even being created/added to the Store in the first place? This sounds like what you were getting at in your comment on the test, so when is raftGroup intentionally nil?

@bdarnell
Copy link
Copy Markdown
Contributor Author

bdarnell commented Feb 8, 2016

OK, I've updated the commit to ensure that Replica.raftGroup is never nil instead of adding checks for it. This was a little tricky because there is a second way in which nil raftGroups are created. When Replicas are created in response to raft messages, they have a nil raftGroup and are then (non-atomically) updated with their replica ID. I fixed this by adding a replicaID argument to NewReplica.

@tamird tamird changed the title storage: Add nil checks for Replica.raftGroup storage: remove the possibility of nil Replica.raftGroup Feb 8, 2016
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Feb 8, 2016

Reviewed 4 of 4 files at r1, 7 of 8 files at r2.
Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions.


storage/client_raft_test.go, line 1460 [r2] (raw file):
Since the store is already running, is this a race? I feel like this should happen even before the replica is removed.


storage/replica.go, line 182 [r2] (raw file):
OT: what's this wrapping at?


storage/replica.go, line 244 [r2] (raw file):
return r.setReplicaIDLocked(replicaID)?


storage/replica_test.go, line 252 [r2] (raw file):
is this saying it's weird because it doesn't call Store.start and does assign IDs or because it neither calls Store.Start nor assigns IDs?


storage/store.go, line 1678 [r2] (raw file):
drive-by, but can this be return r, r.setReplicaID(replicaID)?


Comments from the review on Reviewable.io

Removed replicas are now destroyed eagerly on store startup.

Fixes cockroachdb#4101
Fixes cockroachdb#3896
@bdarnell
Copy link
Copy Markdown
Contributor Author

bdarnell commented Feb 9, 2016

Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions.


storage/client_raft_test.go, line 1460 [r2] (raw file):
It's slightly racy in that the queue could have run by now (although it's using atomic operations internally so it's not a race in the race detector sense). It's also not really necessary: the test works just as well whether the GC queue is enabled or not. It's just an extra line of defense to make it unlikely that the replica was created and then removed via another path.


storage/replica.go, line 182 [r2] (raw file):
80 (same as most of the other comments in the file). The "wrap comments at 100" guideline doesn't seem to have taken hold.


storage/replica_test.go, line 252 [r2] (raw file):
The latter. Changed to "doesn't call Store.Start which would assign IDs"


storage/store.go, line 1678 [r2] (raw file):
Could be, but it would diverge from the usual pattern in which all other results are zero-valued on error.


Comments from the review on Reviewable.io

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Feb 9, 2016

Reviewed 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


storage/client_raft_test.go, line 1460 [r2] (raw file):
That's even worse - that means that this could cause this test to fail to detect the failure condition it's testing for. Why not disable the GC queue at the outset?


storage/replica.go, line 234 [r3] (raw file):
this could be in an else if below, but either way is fine


Comments from the review on Reviewable.io

@bdarnell
Copy link
Copy Markdown
Contributor Author

bdarnell commented Feb 9, 2016

Review status: all files reviewed at latest revision, 2 unresolved discussions.


storage/client_raft_test.go, line 1460 [r2] (raw file):
Because "the outset" is not accessible; we'd have to do a larger refactoring to be able to call DisableReplicaGCQueue in between NewStore and Start. I don't think it's worth that - it's too hypothetical of a concern. I'd rather just remove this line than make it any less racy.


Comments from the review on Reviewable.io

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Feb 9, 2016

LGTM


Review status: all files reviewed at latest revision, 2 unresolved discussions.


storage/client_raft_test.go, line 1460 [r2] (raw file):
Heh, sorry, I didn't notice the mtc.stores[0].DisableReplicaGCQueue at the top of this test; this already does what I wanted.


Comments from the review on Reviewable.io

@bdarnell
Copy link
Copy Markdown
Contributor Author

bdarnell commented Feb 9, 2016

Review status: all files reviewed at latest revision, 1 unresolved discussion.


storage/client_raft_test.go, line 1460 [r2] (raw file):
Yeah, that's the important one, and that one doesn't race.


Comments from the review on Reviewable.io

@tbg
Copy link
Copy Markdown
Member

tbg commented Feb 9, 2016

Nice, LGTM.


Reviewed 4 of 8 files at r2, 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


Comments from the review on Reviewable.io

bdarnell added a commit that referenced this pull request Feb 9, 2016
storage: remove the possibility of nil Replica.raftGroup
@bdarnell bdarnell merged commit f05d190 into cockroachdb:master Feb 9, 2016
@bdarnell bdarnell deleted the raft-group-nil branch February 9, 2016 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants