Skip to content

storage: return error from split/merge lock acquisition#19448

Merged
tbg merged 2 commits intocockroachdb:masterfrom
tbg:return-nil
Oct 23, 2017
Merged

storage: return error from split/merge lock acquisition#19448
tbg merged 2 commits intocockroachdb:masterfrom
tbg:return-nil

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Oct 23, 2017

See #19172.

@tbg tbg requested a review from a team October 23, 2017 15:40
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Oct 23, 2017

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/store.go, line 3878 at r1 (raw file):

			//
			// Inspired by https://github.com/cockroachdb/cockroach/pull/19404.
			runtime.Gosched()

Gosched can hurt performance, especially in the tail latencies, so I wouldn't say there's no harm. That said, this shouldn't be very common so I doubt it will have an impact. Still, I'm not sure it's worth the risk if there's no reason to add this.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Oct 23, 2017

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/store.go, line 3878 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Gosched can hurt performance, especially in the tail latencies, so I wouldn't say there's no harm. That said, this shouldn't be very common so I doubt it will have an impact. Still, I'm not sure it's worth the risk if there's no reason to add this.

Note that the Gosched is on the err = errRetry path, so I doubt it matters for performance. And, with a loop like this, I can't prove that there's no reason to add this, and unless we can't, I think we should generally add it to avoid this kind of freak deadlock.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 4536 at r2 (raw file):

	rightRng, created, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
	if err != nil {
		log.Fatalf(ctx, "while acquiring split lock: %s", err)

Shouldn't you return the error to the caller in order to force error the Raft command? The code in Store.tryGetOrCreateReplica has enough error paths that I can't convince myself none of them will occur.


pkg/storage/store.go, line 3878 at r1 (raw file):

			//
			// Inspired by https://github.com/cockroachdb/cockroach/pull/19404.
			runtime.Gosched()

Was this addition motivated by a real concern. Store.tryGetOrCreateReplica performs multiple lock operations which will definitely provide stop-the-world safe points for the GC.


Comments from Reviewable

tbg added 2 commits October 23, 2017 18:12
Also report the error to sentry because we suspect it to have caused
the bug referenced below in versions of CockroachDB not running with
this commit.

See cockroachdb#19172.
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Oct 23, 2017

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 4536 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't you return the error to the caller in order to force error the Raft command? The code in Store.tryGetOrCreateReplica has enough error paths that I can't convince myself none of them will occur.

Good point. I definitely want to learn about the errors out there in the wild though, so I added reporting to sentry. PTAL!


pkg/storage/store.go, line 3878 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Was this addition motivated by a real concern. Store.tryGetOrCreateReplica performs multiple lock operations which will definitely provide stop-the-world safe points for the GC.

No, just paranoia given the recent deadlock in TestRefreshPendingCmds. You're right that the abundance of lock acquisition makes this too paranoid. Removed the commit.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Reviewed 2 of 4 files at r3.
Review status: 2 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Oct 23, 2017

LGTM, but the PR title is incorrect now.

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: 2 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@tbg tbg changed the title storage: fatal on error in acquireSplitLock storage: return error from split/merge lock acquisition Oct 23, 2017
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Oct 23, 2017

TFTRs, everyone! @nvanbenschoten changed the title.

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jul 30, 2019

I stumbled upon this PR while cleaning up #38954 and am now confused about why setting forcedErr to the result of maybeAcquireSplitMergeLock is correct. We don't expect to ever see an error here, but if we do, I don't think we can silently let it prevent one of the replicas from applying the split/merge. That's the kind of thing that causes replica inconsistencies. I think the only option we have is to fatal.

nvb added a commit to nvb/cockroach that referenced this pull request Jul 30, 2019
If this operation fails then we can't simply return an error and
not apply the command. We need to be deterministic at this point.
The only option we have is to fatal.

See cockroachdb#19448 (comment).

The commit also removes the crash reporting added here, which doesn't
appear to have ever fired after two years.

Release note: None
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jul 30, 2019

Not sure what we were thinking there, Fatal looks like the only option.

craig Bot pushed a commit that referenced this pull request Jul 30, 2019
39158: storage: clean up below-Raft Split and Merge locking r=bdarnell a=nvanbenschoten

This PR contains three related cleanups that I stumbled into while cleaning up raft entry application in preparation for #38954.

The first cleanup is that we avoid improperly handling errors that may be returned when acquiring the split/merge locks. If this operation fails then we can't simply return an error and not apply the command. We need to be deterministic at this point. The only option we have is to fatal. See #19448 (comment).

The second cleanup is that we then remove stale code that was attempting to recover from failed split and merge application. This error handling only made sense in a pre-proposer evaluated KV world.

The third commit addresses a TODO to assert that the RHS range during a split is not initialized. The TODO looks like it went in before proposer evaluated KV, which protects us against both replays and reproposals.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants