Skip to content

kv,storage: disable pipelining in merge transaction#27829

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
benesch:merge-no-pipe
Jul 25, 2018
Merged

kv,storage: disable pipelining in merge transaction#27829
craig[bot] merged 1 commit intocockroachdb:masterfrom
benesch:merge-no-pipe

Conversation

@benesch
Copy link
Copy Markdown
Contributor

@benesch benesch commented Jul 22, 2018

It turns out that pipelining is fundamentally incompatible with merge
transactions. The extra QueryIntent requests generated by pipelining can
get routed to a RHS that is currently blocked for merging and wedge the
merge. Consider the following sequence of events:

  1. The merge transaction begins and lays down intents on the meta2 and
    local copy of the RHS range descriptor using asynchronous commits.

  2. The RHS loses its lease.

  3. The merge transaction sends a GetSnapshotForMergeRequest. The
    pipeliner injects two QueryIntent requests, one to the meta2 range
    descriptor and one to the local range descriptor on the RHS, as
    non-transactional requests always flush the pipeline.

  4. The QueryIntent request hits the RHS and triggers a lease
    acquisition. The lease acquisition notices that the local range
    descriptor has a deletion intent, infers that a merge is in
    progress, and blocks all traffic, except for GetSnapshotForMerge
    requests.

GetSnapshotForMerge requests are allowed to proceed even when the range
is blocked for merging but QueryIntent requests are not. We can't, in
general, allow QueryIntent requests through, as we might return stale
data if the merge has already committed and the LHS is processing
new writes. We could, I think, allow QueryIntents for the local range
descriptor to proceed, but that's a very-specific corner case and hard
to reason about.

Just disable pipelining in the merge transaction to sidestep the issue.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) It was decided
that this option was too disruptive and complex to pursue for the time
being.

Fix #27820.

@benesch benesch added the do-not-merge bors won't merge a PR with this label. label Jul 22, 2018
@benesch benesch requested review from a team, bdarnell, nvb and tbg July 22, 2018 02:57
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jul 23, 2018

This approach seems fine, although it is unfortunate how much plumbing is required just to pass the disable flag all the way down. When I was thinking about this before I also considered a "meta request" that the txnPipeliner could *ahem* intercept, but since there was no precedent for that I figured it would be poor style.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) That seems
unnecessarily disruptive though.

Is this true? Wouldn't the same reasoning that allows for lease transfers to avoid waiting out the max offset also apply here?

@benesch
Copy link
Copy Markdown
Contributor Author

benesch commented Jul 23, 2018

Is this true? Wouldn't the same reasoning that allows for lease transfers to avoid waiting out the max offset also apply here?

It seems that on lease transfers the old leaseholder promises not to serve any more reads, even if those reads are beneath the timestamp of the new lease:

// start timestamp of the new (transferred) lease. More subtly, the
// replica can't even serve reads or propose commands with timestamps
// lower than the start of the new lease because it could lead to read
// your own write violations (see comments on the stasis period in
// IsLeaseValid). We could, in principle, serve reads more than the
// maximum clock offset in the past.

@bdarnell @tschottdorf could one of you weigh in here? The "comments on the stasis period in IsLeaseValid" that supposedly explain all this don't exist, or at least I can't find them.

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those comments have migrated to (*Replica).leaseStatus. They don't address transfers directly but there's an explanation of the whole "lease stasis" concept. In effect, a lease transfer retroactively creates a stasis period for the timestamps prior to the transfer. However, we haven't implemented this exactly; we just use the blunter solution of disallowing all reads from a leaseholder that has transferred its lease away.

The same would apply to merges: the RHS could continue to serve reads with timestamps less than mergeTimestamp - maxOffset, but not at mergeTimestamp itself.

This all due to clock uncertainty and the fact that a read right before the end of a lease might need to return a ReadWithinUncertaintyIntervalError. There might be a subtle/clever solution involving communication between the LHS and RHS leaseholders that gives us certainty about the clock. Or maybe QueryIntent can be a special case that is allowed during stasis because it uses a timestamp at which a previous write was known to have occurred.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@benesch benesch changed the title [wip] kv,storage: disable pipelining in merge transaction kv,storage: disable pipelining in merge transaction Jul 24, 2018
@benesch benesch removed the do-not-merge bors won't merge a PR with this label. label Jul 24, 2018
@benesch
Copy link
Copy Markdown
Contributor Author

benesch commented Jul 24, 2018

Per discussion in #core, the disable-pipelining knob is the way to go. This is ready for review!

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/kv/txn_coord_sender_test.go, line 1729 at r1 (raw file):

}

func TestTxnCoordSenderPipelining(t *testing.T) {

Add a comment here explaining what this is testing.


pkg/storage/replica_command.go, line 479 at r1 (raw file):

		}

		// Intents have been placed, so the merge is now in its critical phase. Get

This can be reworked now! The comment is a bit stale, the TODO can be removed, and there's no need to even send the GetSnapshotForMergeRequest through the txn (if you want to switch that back).

tc.mu.onFinishFn = onFinishFn
}

// DisablePipelining is part of the client.TxnSender interface.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there concerns with data races here when the txn is used concurrently? Maybe just something to comment on.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, looks like this is protected by a Mutex in the parent. Nevermind, I think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it'll error out when not the first thing in the txn. Ok, we're probably good... :-)

It turns out that pipelining is fundamentally incompatible with merge
transactions. The extra QueryIntent requests generated by pipelining can
get routed to a RHS that is currently blocked for merging and wedge the
merge. Consider the following sequence of events:

  1. The merge transaction begins and lays down intents on the meta2 and
     local copy of the RHS range descriptor using asynchronous commits.

  2. The RHS loses its lease.

  3. The merge transaction sends a GetSnapshotForMergeRequest. The
     pipeliner injects two QueryIntent requests, one to the meta2 range
     descriptor and one to the local range descriptor on the RHS, as
     non-transactional requests always flush the pipeline.

  4. The QueryIntent request hits the RHS and triggers a lease
     acquisition. The lease acquisition notices that the local range
     descriptor has a deletion intent, infers that a merge is in
     progress, and blocks all traffic, except for GetSnapshotForMerge
     requests.

GetSnapshotForMerge requests are allowed to proceed even when the range
is blocked for merging but QueryIntent requests are not. We can't, in
general, allow QueryIntent requests through, as we might return stale
data if the merge has already committed and the LHS is processing
new writes. We could, I think, allow QueryIntents for the local range
descriptor to proceed, but that's a very-specific corner case and hard
to reason about.

Just disable pipelining in the merge transaction to sidestep the issue.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) It was decided
that this option was too disruptive and complex to pursue for the time
being.

Fix cockroachdb#27820.
Copy link
Copy Markdown
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/kv/txn_coord_sender.go, line 499 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

And it'll error out when not the first thing in the txn. Ok, we're probably good... :-)

Ack.


pkg/kv/txn_coord_sender_test.go, line 1729 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment here explaining what this is testing.

Done.


pkg/storage/replica_command.go, line 479 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This can be reworked now! The comment is a bit stale, the TODO can be removed, and there's no need to even send the GetSnapshotForMergeRequest through the txn (if you want to switch that back).

Done.

@benesch
Copy link
Copy Markdown
Contributor Author

benesch commented Jul 25, 2018

bors r=nvanbenschoten

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 25, 2018

Build failed

@benesch
Copy link
Copy Markdown
Contributor Author

benesch commented Jul 25, 2018

Flake.

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Jul 25, 2018
27829: kv,storage: disable pipelining in merge transaction r=nvanbenschoten a=benesch

It turns out that pipelining is fundamentally incompatible with merge
transactions. The extra QueryIntent requests generated by pipelining can
get routed to a RHS that is currently blocked for merging and wedge the
merge. Consider the following sequence of events:

  1. The merge transaction begins and lays down intents on the meta2 and
     local copy of the RHS range descriptor using asynchronous commits.

  2. The RHS loses its lease.

  3. The merge transaction sends a GetSnapshotForMergeRequest. The
     pipeliner injects two QueryIntent requests, one to the meta2 range
     descriptor and one to the local range descriptor on the RHS, as
     non-transactional requests always flush the pipeline.

  4. The QueryIntent request hits the RHS and triggers a lease
     acquisition. The lease acquisition notices that the local range
     descriptor has a deletion intent, infers that a merge is in
     progress, and blocks all traffic, except for GetSnapshotForMerge
     requests.

GetSnapshotForMerge requests are allowed to proceed even when the range
is blocked for merging but QueryIntent requests are not. We can't, in
general, allow QueryIntent requests through, as we might return stale
data if the merge has already committed and the LHS is processing
new writes. We could, I think, allow QueryIntents for the local range
descriptor to proceed, but that's a very-specific corner case and hard
to reason about.

Just disable pipelining in the merge transaction to sidestep the issue.

One other option: we could allow the RHS to serve reads up until the
commit timestamp of the merge transaction. That would allow the
GetSnapshotForMerge and QueryIntent requests from the merge transaction
through, as well as any other read traffic executing beneath the merge
transaction's timestamp. The downside is that the new leaseholder would
have to wait out the full clock offset after the merge txn committed to
serve any writes. (Just like the lease stasis period.) It was decided
that this option was too disruptive and complex to pursue for the time
being.

Fix #27820.

27920: stopper: Ensure that Closers and Ctxs are always cleaned up r=benesch a=nvanbenschoten

Before this change, the Stopper made no guarantee that `Closers`
registered by the `AddCloser` method or `Contexts` tied to
the `Stopper` with `WithCancelOn{Quiesce,Stop}` were ever properly
cleaned up. In cases where these methods were called concurrently
with the Stopper stopping, it was possible for the resources to
leak. This change fixes this by handling cases where these methods
are called concurrently with or after the Stopper has stopped. This
allows them to provide stronger external guarantees.

The PR also correctly synchronizes access to the `sCancels` map to
fix #27908.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 25, 2018

Build succeeded

@craig craig bot merged commit 408d45a into cockroachdb:master Jul 25, 2018
@benesch benesch deleted the merge-no-pipe branch July 27, 2018 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants