kv,storage: disable pipelining in merge transaction by benesch · Pull Request #27829 · cockroachdb/cockroach

benesch · 2018-07-22T02:57:48Z

It turns out that pipelining is fundamentally incompatible with merge
transactions. The extra QueryIntent requests generated by pipelining can
get routed to a RHS that is currently blocked for merging and wedge the
merge. Consider the following sequence of events:

The merge transaction begins and lays down intents on the meta2 and
local copy of the RHS range descriptor using asynchronous commits.
The RHS loses its lease.
The merge transaction sends a GetSnapshotForMergeRequest. The
pipeliner injects two QueryIntent requests, one to the meta2 range
descriptor and one to the local range descriptor on the RHS, as
non-transactional requests always flush the pipeline.
The QueryIntent request hits the RHS and triggers a lease
acquisition. The lease acquisition notices that the local range
descriptor has a deletion intent, infers that a merge is in
progress, and blocks all traffic, except for GetSnapshotForMerge
requests.

GetSnapshotForMerge requests are allowed to proceed even when the range
is blocked for merging but QueryIntent requests are not. We can't, in
general, allow QueryIntent requests through, as we might return stale
data if the merge has already committed and the LHS is processing
new writes. We could, I think, allow QueryIntents for the local range
descriptor to proceed, but that's a very-specific corner case and hard
to reason about.

Just disable pipelining in the merge transaction to sidestep the issue.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) It was decided
that this option was too disruptive and complex to pursue for the time
being.

Fix #27820.

cockroach-teamcity · 2018-07-22T02:57:54Z

This change is

nvb · 2018-07-23T11:43:44Z

This approach seems fine, although it is unfortunate how much plumbing is required just to pass the disable flag all the way down. When I was thinking about this before I also considered a "meta request" that the txnPipeliner could *ahem* intercept, but since there was no precedent for that I figured it would be poor style.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) That seems
unnecessarily disruptive though.

Is this true? Wouldn't the same reasoning that allows for lease transfers to avoid waiting out the max offset also apply here?

benesch · 2018-07-23T17:57:42Z

Is this true? Wouldn't the same reasoning that allows for lease transfers to avoid waiting out the max offset also apply here?

It seems that on lease transfers the old leaseholder promises not to serve any more reads, even if those reads are beneath the timestamp of the new lease:

cockroach/pkg/storage/replica_range_lease.go

Lines 542 to 547 in 19ed9a3

    
           // start timestamp of the new (transferred) lease. More subtly, the 
        
           // replica can't even serve reads or propose commands with timestamps 
        
           // lower than the start of the new lease because it could lead to read 
        
           // your own write violations (see comments on the stasis period in 
        
           // IsLeaseValid). We could, in principle, serve reads more than the 
        
           // maximum clock offset in the past.

@bdarnell @tschottdorf could one of you weigh in here? The "comments on the stasis period in IsLeaseValid" that supposedly explain all this don't exist, or at least I can't find them.

bdarnell

I think those comments have migrated to (*Replica).leaseStatus. They don't address transfers directly but there's an explanation of the whole "lease stasis" concept. In effect, a lease transfer retroactively creates a stasis period for the timestamps prior to the transfer. However, we haven't implemented this exactly; we just use the blunter solution of disallowing all reads from a leaseholder that has transferred its lease away.

The same would apply to merges: the RHS could continue to serve reads with timestamps less than mergeTimestamp - maxOffset, but not at mergeTimestamp itself.

This all due to clock uncertainty and the fact that a read right before the end of a lease might need to return a ReadWithinUncertaintyIntervalError. There might be a subtle/clever solution involving communication between the LHS and RHS leaseholders that gives us certainty about the clock. Or maybe QueryIntent can be a special case that is allowed during stasis because it uses a timestamp at which a previous write was known to have occurred.

Reviewable status: complete! 0 of 0 LGTMs obtained

benesch · 2018-07-24T21:14:39Z

Per discussion in #core, the disable-pipelining knob is the way to go. This is ready for review!

nvb

Reviewed 8 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/kv/txn_coord_sender_test.go, line 1729 at r1 (raw file):

}

func TestTxnCoordSenderPipelining(t *testing.T) {

Add a comment here explaining what this is testing.

pkg/storage/replica_command.go, line 479 at r1 (raw file):

		}

		// Intents have been placed, so the merge is now in its critical phase. Get

This can be reworked now! The comment is a bit stale, the TODO can be removed, and there's no need to even send the GetSnapshotForMergeRequest through the txn (if you want to switch that back).

tbg · 2018-07-25T09:17:52Z

pkg/kv/txn_coord_sender.go

 	tc.mu.onFinishFn = onFinishFn
 }

+// DisablePipelining is part of the client.TxnSender interface.


Are there concerns with data races here when the txn is used concurrently? Maybe just something to comment on.

Oh, looks like this is protected by a Mutex in the parent. Nevermind, I think.

And it'll error out when not the first thing in the txn. Ok, we're probably good... :-)

It turns out that pipelining is fundamentally incompatible with merge transactions. The extra QueryIntent requests generated by pipelining can get routed to a RHS that is currently blocked for merging and wedge the merge. Consider the following sequence of events: 1. The merge transaction begins and lays down intents on the meta2 and local copy of the RHS range descriptor using asynchronous commits. 2. The RHS loses its lease. 3. The merge transaction sends a GetSnapshotForMergeRequest. The pipeliner injects two QueryIntent requests, one to the meta2 range descriptor and one to the local range descriptor on the RHS, as non-transactional requests always flush the pipeline. 4. The QueryIntent request hits the RHS and triggers a lease acquisition. The lease acquisition notices that the local range descriptor has a deletion intent, infers that a merge is in progress, and blocks all traffic, except for GetSnapshotForMerge requests. GetSnapshotForMerge requests are allowed to proceed even when the range is blocked for merging but QueryIntent requests are not. We can't, in general, allow QueryIntent requests through, as we might return stale data if the merge has already committed and the LHS is processing new writes. We could, I think, allow QueryIntents for the local range descriptor to proceed, but that's a very-specific corner case and hard to reason about. Just disable pipelining in the merge transaction to sidestep the issue. One other option: we could allow the RHS to serve reads up until the commit timestamp of the merge transaction. That would allow the GetSnapshotForMerge and QueryIntent requests from the merge transaction through, as well as any other read traffic executing beneath the merge transaction's timestamp. The downside is that the new leaseholder would have to wait out the full clock offset after the merge txn committed to serve any writes. (Just like the lease stasis period.) It was decided that this option was too disruptive and complex to pursue for the time being. Fix cockroachdb#27820.

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/kv/txn_coord_sender.go, line 499 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

And it'll error out when not the first thing in the txn. Ok, we're probably good... :-)

Ack.

pkg/kv/txn_coord_sender_test.go, line 1729 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment here explaining what this is testing.

Done.

pkg/storage/replica_command.go, line 479 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This can be reworked now! The comment is a bit stale, the TODO can be removed, and there's no need to even send the GetSnapshotForMergeRequest through the txn (if you want to switch that back).

Done.

benesch · 2018-07-25T16:40:14Z

bors r=nvanbenschoten

craig · 2018-07-25T17:14:57Z

Build failed

GitHub CI (Cockroach)

benesch · 2018-07-25T18:15:08Z

Flake.

bors r=nvanbenschoten

27829: kv,storage: disable pipelining in merge transaction r=nvanbenschoten a=benesch It turns out that pipelining is fundamentally incompatible with merge transactions. The extra QueryIntent requests generated by pipelining can get routed to a RHS that is currently blocked for merging and wedge the merge. Consider the following sequence of events: 1. The merge transaction begins and lays down intents on the meta2 and local copy of the RHS range descriptor using asynchronous commits. 2. The RHS loses its lease. 3. The merge transaction sends a GetSnapshotForMergeRequest. The pipeliner injects two QueryIntent requests, one to the meta2 range descriptor and one to the local range descriptor on the RHS, as non-transactional requests always flush the pipeline. 4. The QueryIntent request hits the RHS and triggers a lease acquisition. The lease acquisition notices that the local range descriptor has a deletion intent, infers that a merge is in progress, and blocks all traffic, except for GetSnapshotForMerge requests. GetSnapshotForMerge requests are allowed to proceed even when the range is blocked for merging but QueryIntent requests are not. We can't, in general, allow QueryIntent requests through, as we might return stale data if the merge has already committed and the LHS is processing new writes. We could, I think, allow QueryIntents for the local range descriptor to proceed, but that's a very-specific corner case and hard to reason about. Just disable pipelining in the merge transaction to sidestep the issue. One other option: we could allow the RHS to serve reads up until the commit timestamp of the merge transaction. That would allow the GetSnapshotForMerge and QueryIntent requests from the merge transaction through, as well as any other read traffic executing beneath the merge transaction's timestamp. The downside is that the new leaseholder would have to wait out the full clock offset after the merge txn committed to serve any writes. (Just like the lease stasis period.) It was decided that this option was too disruptive and complex to pursue for the time being. Fix #27820. 27920: stopper: Ensure that Closers and Ctxs are always cleaned up r=benesch a=nvanbenschoten Before this change, the Stopper made no guarantee that `Closers` registered by the `AddCloser` method or `Contexts` tied to the `Stopper` with `WithCancelOn{Quiesce,Stop}` were ever properly cleaned up. In cases where these methods were called concurrently with the Stopper stopping, it was possible for the resources to leak. This change fixes this by handling cases where these methods are called concurrently with or after the Stopper has stopped. This allows them to provide stronger external guarantees. The PR also correctly synchronizes access to the `sCancels` map to fix #27908. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2018-07-25T18:50:23Z

Build succeeded

GitHub CI (Cockroach)

benesch added the do-not-merge bors won't merge a PR with this label. label Jul 22, 2018

benesch requested review from a team, bdarnell, nvb and tbg July 22, 2018 02:57

bdarnell reviewed Jul 23, 2018

View reviewed changes

benesch force-pushed the merge-no-pipe branch from ed00d1b to 3b88b6d Compare July 24, 2018 21:07

benesch changed the title ~~[wip] kv,storage: disable pipelining in merge transaction~~ kv,storage: disable pipelining in merge transaction Jul 24, 2018

benesch removed the do-not-merge bors won't merge a PR with this label. label Jul 24, 2018

nvb approved these changes Jul 25, 2018

View reviewed changes

tbg reviewed Jul 25, 2018

View reviewed changes

benesch force-pushed the merge-no-pipe branch from 3b88b6d to 57d7a09 Compare July 25, 2018 15:38

benesch force-pushed the merge-no-pipe branch from 57d7a09 to 408d45a Compare July 25, 2018 15:39

benesch commented Jul 25, 2018

View reviewed changes

craig bot merged commit 408d45a into cockroachdb:master Jul 25, 2018

benesch deleted the merge-no-pipe branch July 27, 2018 17:48

Conversation

benesch commented Jul 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 22, 2018

Uh oh!

nvb commented Jul 23, 2018

Uh oh!

benesch commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

benesch commented Jul 24, 2018

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

tbg Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

tbg Jul 25, 2018

Choose a reason for hiding this comment

Uh oh!

benesch left a comment

Choose a reason for hiding this comment

Uh oh!

benesch commented Jul 25, 2018

Uh oh!

craig bot commented Jul 25, 2018

Build failed

Uh oh!

benesch commented Jul 25, 2018

Uh oh!

craig bot commented Jul 25, 2018

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benesch commented Jul 22, 2018 •

edited

Loading

benesch commented Jul 23, 2018 •

edited

Loading