Skip to content

kv: reject lease transfers during in-progress range merge#60905

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/rangeMergeLeaseTransfer
Mar 2, 2021
Merged

kv: reject lease transfers during in-progress range merge#60905
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/rangeMergeLeaseTransfer

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Feb 22, 2021

Found with kvnemesis's new ability to issue lease transfers.

This commit prevents TransferLease requests from waiting on in-progress
merges, instead rejecting them in handleMergeInProgressError. This is
necessary because the merge may need to acquire a range lease in order
to complete if it still needs to perform its Subsume request, which it
likely will if this lease transfer revoked the leaseholder's existing
range lease. Any concurrent lease acquisition attempt will be blocked on
this lease transfer because a replica only performs a single lease
operation at a time, so we reject to prevent a deadlock.

The commit adds a new TestStoreRangeMergeRHSLeaseTransfers test which
reliably hit this deadlock before the issue was fixed.

Release note (bug fix): Fixes a rare deadlock where a series of lease
transfers concurrent with a Range merge could block each other from
ever completing.

Release justification: bug fix

@nvb nvb requested a review from tbg February 22, 2021 00:22
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/rangeMergeLeaseTransfer branch from c5c3de2 to d283ec5 Compare February 22, 2021 07:32
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):

			// A lease transfer that races with a replica removal may find that
			// the store it was targeting is no longer part of the range.
		} else if resultIsError(t.Result, `cannot transfer lease while merge in progress`) {

Can you use errors.Is instead? I know we're relying on string matching all over the place in this file, but there really shouldn't be a reason now that *roachpb.Error carries any structure with it. For example, for ChangeReplicas above we've already completely phase it out (though I believe there's an open flake).


pkg/kv/kvserver/client_merge_test.go, line 1555 at r1 (raw file):

	testingRequestFilter := func(_ context.Context, ba roachpb.BatchRequest) *roachpb.Error {
		if ba.IsSingleSubsumeRequest() {
			subsumeReceived <- struct{}{}

I know we gave extra buffer, but why not make this non-blocking?


pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):

	// Launch the merge.
	mergeErr := make(chan error)
	go func() {

Just for hygiene, can you use the stopper here?


pkg/kv/kvserver/client_merge_test.go, line 1601 at r1 (raw file):

	// Transfer the lease to store 0. Even though the Subsume request has not
	// yet evaluted, the new leaseholder will notice the deletion intent on its

evaluated

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@nvb nvb force-pushed the nvanbenschoten/rangeMergeLeaseTransfer branch from d283ec5 to 2cd951a Compare February 25, 2021 06:16
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you use errors.Is instead? I know we're relying on string matching all over the place in this file, but there really shouldn't be a reason now that *roachpb.Error carries any structure with it. For example, for ChangeReplicas above we've already completely phase it out (though I believe there's an open flake).

I'm 💯 on this, though I'd rather do it all in one pass and clean things up like you did with kvserver.IsRetriableReplicationChangeError. You ok if I open an issue to address this when I push more on kvnemesis in the coming weeks?


pkg/kv/kvserver/client_merge_test.go, line 1555 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I know we gave extra buffer, but why not make this non-blocking?

Done.


pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just for hygiene, can you use the stopper here?

Is this something we want to start pushing for more in tests, now that #59647 has landed? I'm mostly supportive of it, except that RunAsyncTask returns an error that needs to be checked, so it's more heavyweight than a raw goroutine.

So I guess the options are:

  1. keep promoting the use of raw goroutines in tests, go func() { ... }()
  2. start switching over to require.NoError(tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... }))
  3. start switching over to _ = tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... })

What do you think?


pkg/kv/kvserver/client_merge_test.go, line 1601 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

evaluated

Done.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm 💯 on this, though I'd rather do it all in one pass and clean things up like you did with kvserver.IsRetriableReplicationChangeError. You ok if I open an issue to address this when I push more on kvnemesis in the coming weeks?

Not at all!


pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this something we want to start pushing for more in tests, now that #59647 has landed? I'm mostly supportive of it, except that RunAsyncTask returns an error that needs to be checked, so it's more heavyweight than a raw goroutine.

So I guess the options are:

  1. keep promoting the use of raw goroutines in tests, go func() { ... }()
  2. start switching over to require.NoError(tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... }))
  3. start switching over to _ = tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... })

What do you think?

2 and 3 are both fine, I mostly just use 2 because it's cheap enough and we don't typically start tons of goroutines in tests.
This is also a nit, I think until we lint against raw goroutines it's not that productive to try to enforce a convention for test usage in reviews.

Found with kvnemesis's new ability to issue lease transfers.

This commit prevents TransferLease requests from waiting on in-progress
merges, instead rejecting them in `handleMergeInProgressError`. This is
necessary because the merge may need to acquire a range lease in order
to complete if it still needs to perform its Subsume request, which it
likely will if this lease transfer revoked the leaseholder's existing
range lease. Any concurrent lease acquisition attempt will be blocked on
this lease transfer because a replica only performs a single lease
operation at a time, so we reject to prevent a deadlock.

The commit adds a new `TestStoreRangeMergeRHSLeaseTransfers` test which
reliably hit this deadlock before the issue was fixed.

Release note (bug fix): Fixes a rare deadlock where a series of lease
transfers concurrent with a Range merge could block each other from
ever completing.

Release justification: bug fix
@nvb nvb force-pushed the nvanbenschoten/rangeMergeLeaseTransfer branch from 2cd951a to 9fbbb1a Compare March 2, 2021 05:56
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not at all!

Opened #61319.


pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

2 and 3 are both fine, I mostly just use 2 because it's cheap enough and we don't typically start tons of goroutines in tests.
This is also a nit, I think until we lint against raw goroutines it's not that productive to try to enforce a convention for test usage in reviews.

Done.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 2, 2021

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 2, 2021

Build succeeded:

@craig craig bot merged commit f8b7708 into cockroachdb:master Mar 2, 2021
@nvb nvb deleted the nvanbenschoten/rangeMergeLeaseTransfer branch March 3, 2021 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants