kv: reject lease transfers during in-progress range merge#60905
kv: reject lease transfers during in-progress range merge#60905craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
c5c3de2 to
d283ec5
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):
// A lease transfer that races with a replica removal may find that // the store it was targeting is no longer part of the range. } else if resultIsError(t.Result, `cannot transfer lease while merge in progress`) {
Can you use errors.Is instead? I know we're relying on string matching all over the place in this file, but there really shouldn't be a reason now that *roachpb.Error carries any structure with it. For example, for ChangeReplicas above we've already completely phase it out (though I believe there's an open flake).
pkg/kv/kvserver/client_merge_test.go, line 1555 at r1 (raw file):
testingRequestFilter := func(_ context.Context, ba roachpb.BatchRequest) *roachpb.Error { if ba.IsSingleSubsumeRequest() { subsumeReceived <- struct{}{}
I know we gave extra buffer, but why not make this non-blocking?
pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):
// Launch the merge. mergeErr := make(chan error) go func() {
Just for hygiene, can you use the stopper here?
pkg/kv/kvserver/client_merge_test.go, line 1601 at r1 (raw file):
// Transfer the lease to store 0. Even though the Subsume request has not // yet evaluted, the new leaseholder will notice the deletion intent on its
evaluated
tbg
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
d283ec5 to
2cd951a
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you use
errors.Isinstead? I know we're relying on string matching all over the place in this file, but there really shouldn't be a reason now that*roachpb.Errorcarries any structure with it. For example, forChangeReplicasabove we've already completely phase it out (though I believe there's an open flake).
I'm 💯 on this, though I'd rather do it all in one pass and clean things up like you did with kvserver.IsRetriableReplicationChangeError. You ok if I open an issue to address this when I push more on kvnemesis in the coming weeks?
pkg/kv/kvserver/client_merge_test.go, line 1555 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I know we gave extra buffer, but why not make this non-blocking?
Done.
pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just for hygiene, can you use the stopper here?
Is this something we want to start pushing for more in tests, now that #59647 has landed? I'm mostly supportive of it, except that RunAsyncTask returns an error that needs to be checked, so it's more heavyweight than a raw goroutine.
So I guess the options are:
- keep promoting the use of raw goroutines in tests,
go func() { ... }() - start switching over to
require.NoError(tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... })) - start switching over to
_ = tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... })
What do you think?
pkg/kv/kvserver/client_merge_test.go, line 1601 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
evaluated
Done.
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm 💯 on this, though I'd rather do it all in one pass and clean things up like you did with
kvserver.IsRetriableReplicationChangeError. You ok if I open an issue to address this when I push more on kvnemesis in the coming weeks?
Not at all!
pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this something we want to start pushing for more in tests, now that #59647 has landed? I'm mostly supportive of it, except that
RunAsyncTaskreturns an error that needs to be checked, so it's more heavyweight than a raw goroutine.So I guess the options are:
- keep promoting the use of raw goroutines in tests,
go func() { ... }()- start switching over to
require.NoError(tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... }))- start switching over to
_ = tc.Stopper().RunAsyncTask(ctx, "some-task-name", func(ctx context.Context) { ... })What do you think?
2 and 3 are both fine, I mostly just use 2 because it's cheap enough and we don't typically start tons of goroutines in tests.
This is also a nit, I think until we lint against raw goroutines it's not that productive to try to enforce a convention for test usage in reviews.
Found with kvnemesis's new ability to issue lease transfers. This commit prevents TransferLease requests from waiting on in-progress merges, instead rejecting them in `handleMergeInProgressError`. This is necessary because the merge may need to acquire a range lease in order to complete if it still needs to perform its Subsume request, which it likely will if this lease transfer revoked the leaseholder's existing range lease. Any concurrent lease acquisition attempt will be blocked on this lease transfer because a replica only performs a single lease operation at a time, so we reject to prevent a deadlock. The commit adds a new `TestStoreRangeMergeRHSLeaseTransfers` test which reliably hit this deadlock before the issue was fixed. Release note (bug fix): Fixes a rare deadlock where a series of lease transfers concurrent with a Range merge could block each other from ever completing. Release justification: bug fix
2cd951a to
9fbbb1a
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvnemesis/validator.go, line 351 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not at all!
Opened #61319.
pkg/kv/kvserver/client_merge_test.go, line 1590 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
2 and 3 are both fine, I mostly just use 2 because it's cheap enough and we don't typically start tons of goroutines in tests.
This is also a nit, I think until we lint against raw goroutines it's not that productive to try to enforce a convention for test usage in reviews.
Done.
|
bors r+ |
|
Build succeeded: |
Found with kvnemesis's new ability to issue lease transfers.
This commit prevents TransferLease requests from waiting on in-progress
merges, instead rejecting them in
handleMergeInProgressError. This isnecessary because the merge may need to acquire a range lease in order
to complete if it still needs to perform its Subsume request, which it
likely will if this lease transfer revoked the leaseholder's existing
range lease. Any concurrent lease acquisition attempt will be blocked on
this lease transfer because a replica only performs a single lease
operation at a time, so we reject to prevent a deadlock.
The commit adds a new
TestStoreRangeMergeRHSLeaseTransferstest whichreliably hit this deadlock before the issue was fixed.
Release note (bug fix): Fixes a rare deadlock where a series of lease
transfers concurrent with a Range merge could block each other from
ever completing.
Release justification: bug fix