Skip to content

kv: fix issues around failed 1PC txns#26497

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:txn-tracking
Jun 7, 2018
Merged

kv: fix issues around failed 1PC txns#26497
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:txn-tracking

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

The recent #25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None

@andreimatei andreimatei requested a review from a team June 7, 2018 00:40
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

r1 is #26479


Review status: 0 of 8 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Jun 7, 2018

:lgtm:

How did you determine that this fix was needed? I assume you were seeing cases where intents were being left pending; should we have a roachtest that exhibits this more clearly?


Reviewed 7 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, all discussions resolved, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

		tc.mu.meta.Txn = br.Txn.Clone()
		_, hasBT := ba.GetArg(roachpb.BeginTransaction)
		onePC := br.Txn.Status == roachpb.COMMITTED && hasBT

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)


pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

		Txn: txn.Proto(),
	}
	// ba := roachpb.BatchRequest{}

Remove these lines?


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 7, 2018

Reviewed 3 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/internal/client/sender.go, line 111 at r2 (raw file):

// TxnSenders with GetMeta or AugmentMeta panicing with unimplemented. This is
// a helper mechanism to facilitate testing.
type TxnSenderAdapter struct {

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.


pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

// the client.Txn is expected to create a new TxnCoordSender instance
// transparently for the higher-level client).
// - Ensures atomic execution for non-transactional (write) batches by transparently

We need to do this even for scans that span ranges.


Comments from Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Review status: 7 of 8 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We need to do this even for scans that span ranges.

do we? If the txn is read-only, isn't a consistent timestamp the only thing we need across ranges?


Comments from Reviewable

The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None
@andreimatei andreimatei requested review from a team June 7, 2018 19:10
@andreimatei
Copy link
Copy Markdown
Contributor Author

As we discussed, badness had not been seen in the wild. I think the test I've added is sufficient.


Review status: 2 of 8 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/internal/client/sender.go, line 111 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.

Done.


pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)

Done.


pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove these lines?

Done.


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 7, 2018

:lgtm:


Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

bors r+


Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jun 7, 2018
26497: kv: fix issues around failed 1PC txns r=andreimatei a=andreimatei

The recent #25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 7, 2018

Build succeeded

@craig craig bot merged commit 22554db into cockroachdb:master Jun 7, 2018
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Jun 11, 2018
I've introduced a bug in cockroachdb#26497 - if there's concurrent writing requests
going though a single client.Txn, it can be that some of the intents
are not tracked correctly. Before this patch, intents started being
tracked when the TxnCoordSender sees a BeginTransaction. Even through
the client.Txn attaches an EndTransaction to the first writing requests
that it sees, a 2nd request might pass through and get to the
TxnCoordSender before that BeginTxn. In fact, it might even get a
response before the TxnCoordSender sees the BeginTxn, in which case we
wouldn't have recorded its intents.
This patch makes the TxnCoordSender track intents from all writing
requests. The BeginTxn doesn't matter for tracking intents any more. It
continues to matter for starting the heartbeat loop. We could
alternatively move to starting the hb loop when seeing the first write,
but I didn't do it.

Fixes cockroachdb#26538

Release note: None
@andreimatei andreimatei deleted the txn-tracking branch June 11, 2018 21:38
craig bot pushed a commit that referenced this pull request Jun 12, 2018
26606: kv: fix intent tracking r=andreimatei a=andreimatei

I've introduced a bug in #26497 - if there's concurrent writing requests
going though a single client.Txn, it can be that some of the intents
are not tracked correctly. Before this patch, intents started being
tracked when the TxnCoordSender sees a BeginTransaction. Even through
the client.Txn attaches an EndTransaction to the first writing requests
that it sees, a 2nd request might pass through and get to the
TxnCoordSender before that BeginTxn. In fact, it might even get a
response before the TxnCoordSender sees the BeginTxn, in which case we
wouldn't have recorded its intents.
This patch makes the TxnCoordSender track intents from all writing
requests. The BeginTxn doesn't matter for tracking intents any more. It
continues to matter for starting the heartbeat loop. We could
alternatively move to starting the hb loop when seeing the first write,
but I didn't do it.

Fixes #26538

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants