kv: stop concerning the TxnCoordSender with abandoned txns by andreimatei · Pull Request #23055 · cockroachdb/cockroach

andreimatei · 2018-02-25T01:59:24Z

Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads:

kv: stop concerning the TxnCoordSender with abandoned txns …
Before this patch, the TxnCoordSender had a role in cleaning up
abandoned transactions, where abandoned was defined in a funky way:

for txns with a cancelable context, a txn was abandoned if the ctx
had been canceled.
for the other txns, the txn was considered abandoned if it ran for 10
seconds.

SQL only triggers 1).

The 10 second timeout was a relicv of the past, from a time where the
coordinator was possibly removed from the client. No more, they have
been collocated for years.
In fact, the whole idea of the TCS detecting abandoned txns is a relicv
of that past: all the txn users are already always cleaning up after
themselves.

The pressing motivation for this change is that there's an issue with
checking the context: there is currently no way to pass a ctx with the
same lifetime as the txn to the TxnCoordSender (as it is only a Sender).
And so the TCS was capturing the ctx of the first Send() op as the ctx
of the txn: this was a major hack that's a big problem - we can't have
per-statement contexts, for example. This also restricts the possible
implementations of statement cancelation: we have been forced to cancel
the whole transaction, which is not ideal.

This patch removes TCS' involvement in cleanup by relying on the client
to always send an EndTransaction. SQL was already doing that, as was
everybody else. The TCS is simplified.

The GCQueue maintains its role in detecting abandoned transactions - it
coninues to do so with an one hour timeout. This detection is needed, of
course, since the client is not collocated with the txn's range.

This patch removes the txn.Abandons metric which was maintained by the
TCS. There is a similar metric maintained by the GCQueue. The TCS one
was incorrect anyway, as it was being incremented on ctx cancelation.

Release note: None

cockroach-teamcity · 2018-02-25T01:59:33Z

This change is

andreimatei · 2018-02-25T02:01:30Z

Spencer, I can find another reviewer if you don't want to do it.
cc @jordanlewis to note that the nasty ctx capture in TxnCS is going away.

There's a commented out test for which I have to figure out what to do: it's spirit might be worth preserving.

bdarnell · 2018-02-25T21:51:09Z

for 2.1 (but I wouldn't cherry-pick it to 2.0, and so we might want to wait to merge this to minimize complications from any other bug-fixing that may go on in this area in the next few weeks)

Reviewed 1 of 1 files at r2, 1 of 1 files at r4, 2 of 2 files at r5, 1 of 2 files at r6, 2 of 2 files at r7, 4 of 4 files at r8.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file):

// !!!
// // TestTxnReadAfterAbandon checks the fix for the condition in issue #4787:

This seems fine to delete to me - how would this test make sense in the absence of abandonment?

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

        <Metric name="cr.node.txn.commits1PC" title="Fast-path Committed" nonNegativeRate />
        <Metric name="cr.node.txn.aborts" title="Aborted" nonNegativeRate />
        // !!!

I'd just remove the next line.

Comments from Reviewable

andreimatei · 2018-02-25T22:04:35Z

Yeah, I definitely won't cherry-pick this, it's the kind of change that can have some unintended consequences. For example I already know one that I need to figure out a story for - we sometimes cancel high-level contexts that still result in sending rollback requests (for example when the network connection drops). Except the rollbacks never get sent because the context is canceled (and so gRPC or someone else along the way refuses to send anything). I believe things used to work to some extent because the heartbeat loop would have detected the cancelation and send an EndTransaction using a different, not canceled, context specially built for the purpose. So we used to both call something like txn.CleanupOnError() - which would have surprisingly been a no-op, and independently the heartbeat loop would send an EndTransaction that actually does something. I need to figure out what the right thing to do is to make sure that we can send rollbacks even with a canceled ctx.

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems fine to delete to me - how would this test make sense in the absence of abandonment?

well, what I believe makes sense to test is that sending through a txn whose heartbeat had previously failed (async) is somehow rejected

Comments from Reviewable

couchand · 2018-03-05T21:47:18Z

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx

        <Metric name="cr.node.txn.commits" title="Committed" nonNegativeRate />
        <Metric name="cr.node.txn.commits1PC" title="Fast-path Committed" nonNegativeRate />
        <Metric name="cr.node.txn.aborts" title="Aborted" nonNegativeRate />
+        // !!!


andreimatei · 2018-05-02T18:54:34Z

note to self: plan is to check ctx status in txn.rollback() and send a (possibly second) EndTxn on a different ctx.

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

Comments from Reviewable

andreimatei · 2018-05-05T01:31:40Z

Ben, please take another look.
I've made txn.Rollback() work with a canceled context.

There's 2 things left - I've made TestHeartbeatCallbackForDecommissioning flaky. I hope @tschottdorf can tell me whats up.
And I've added a TestCleanupWithCanceledContext to check the txn.Rollback() behavior. Except it passes with the old code too. I think it only fails if the txn record is on a remote node - so I need an RPC. And so I need a split in the keys so I can do my txn on a remote range. Any suggestion for the easiest/best way to write that test would be most welcomed.

Review status: 0 of 21 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd just remove the next line.

Done.

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

Previously, couchand (Andrew Couch) wrote…

???

I was asking about just removing this, but not in so many words.

Comments from Reviewable

bdarnell · 2018-05-06T22:38:11Z

Reviewed 1 of 1 files at r10, 1 of 2 files at r13, 1 of 1 files at r15, 18 of 18 files at r16.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/internal/client/txn.go, line 106 at r16 (raw file):

// distributed transactions (leaf).
//
// If the transactions is used to send any operations, CommitOrCleanup() or

s/transactions/transaction/

pkg/internal/client/txn.go, line 599 at r16 (raw file):

}

func (txn *Txn) rollback(ctx context.Context) *roachpb.Error {

I assume this copy of the method is going away?

pkg/internal/client/txn.go, line 631 at r16 (raw file):

	ctx = txn.db.AnnotateCtx(context.Background())
	go func() {

I think this should be a stopper task.

pkg/internal/client/txn.go, line 687 at r16 (raw file):

// TxnExecOptions controls how Exec() runs a transaction and the corresponding
// closure.
type TxnExecOptions struct {

I'm glad to see this simplification.

pkg/internal/client/txn.go, line 875 at r16 (raw file):

		sender = txn.mu.sender
		if txn.mu.Proto.Status != roachpb.PENDING || txn.mu.finalized {
			onlyRolback := lastIndex == 0 && haveEndTxn && !endTxnRequest.Commit

s/Rolback/Rollback/

pkg/kv/txn_coord_sender.go, line 960 at r16 (raw file):

			return
		case <-tc.stopper.ShouldQuiesce():
			// TODO(andrei): should we tryAsyncAbort() here?

Probably not. This is the point at which new tasks can't be created so I'm not sure if it would even work.

Comments from Reviewable

andreimatei · 2018-05-17T00:42:29Z

first 3 commits are #25541

I've added a commit getting rid of the TCS async aborts.
PTAL, a bunch of stuff changed in the kv: stop concerning the TxnCoordSender with abandoned txns commit too.
Thanks!
cc @tschottdorf

Note to self: DNM until #25586 is elucidated.

Review status: 0 of 39 files reviewed at latest revision, 7 unresolved discussions.

pkg/internal/client/txn.go, line 106 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/transactions/transaction/

Done.

pkg/internal/client/txn.go, line 599 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I assume this copy of the method is going away?

yes, gone

pkg/internal/client/txn.go, line 631 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this should be a stopper task.

will do

pkg/internal/client/txn.go, line 875 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/Rolback/Rollback/

Done.

Comments from Reviewable

nvb · 2018-06-07T19:40:58Z

although s/relicv/relic/g

Reviewed 1 of 23 files at r27, 5 of 5 files at r28, 1 of 1 files at r29, 1 of 1 files at r30, 1 of 1 files at r31, 2 of 2 files at r32, 2 of 2 files at r33, 1 of 1 files at r34, 18 of 19 files at r35, 1 of 4 files at r36, 5 of 5 files at r37, 1 of 1 files at r38.
Review status: 20 of 23 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

pkg/internal/client/client_test.go, line 1112 at r35 (raw file):

	// Do a Get using a different ctx (not the canceled one), and check that it
	// didn't take too long - take that as proof that it was not blocked on
	// intents.

The transaction would take a long time if the intents were blocked because .... (something about that transaction abandoned timeout)

pkg/internal/client/txn.go, line 664 at r35 (raw file):

	ctx = txn.db.AnnotateCtx(context.Background())
	// !!! This needs to be a stopper task; plumb the stopper to the DB.

You're not merging this, right?

EDIT: squash the commit that fixes this.

pkg/internal/client/txn.go, line 934 at r35 (raw file):

		sender = txn.mu.sender
		if txn.mu.Proto.Status != roachpb.PENDING || txn.mu.finalized {
			onlyRollback := lastIndex == 0 && haveEndTxn && !endTxnRequest.Commit

What's the point of allowing this rollback through when txn.mu.Proto.Status != roachpb.PENDING?

pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file):

		return nil, err
	}
	if err := txn.CommitOrCleanup(ctx); err != nil {

Should we commit this or roll it back? I agree that needing a transaction during stmt preparing is strange. It's even stranger that the txn would commit. If we're relying on this commit then that seems like a problem.

Comments from Reviewable

Copying from cockroachdb#26524 What's going on here is that this test relies on a rollback (kv-level) working with a canceled context. It never really worked (the first rollback attempt failed), however certain canceled contexts were also stopping the heartbeat loop as a byproduct, and that was doing its own cleanup. The recent cockroachdb#26479 changed things, however - there would be no more stopping of the heartbeat loop after a failed rollback attempt. This is a legit enough regression from cockroachdb#26479, but it's being fixed in the imminent cockroachdb#23055 which makes the rollbacks with canceled contexts work properly. Touches cockroachdb#26524 Release note: None

andreimatei · 2018-06-07T21:09:54Z

Review status: 20 of 23 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

pkg/internal/client/client_test.go, line 1112 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The transaction would take a long time if the intents were blocked because .... (something about that transaction abandoned timeout)

added more words

pkg/internal/client/txn.go, line 664 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You're not merging this, right?

EDIT: squash the commit that fixes this.

Done.

pkg/internal/client/txn.go, line 934 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's the point of allowing this rollback through when txn.mu.Proto.Status != roachpb.PENDING?

well if the status is Aborted, there might still be intents to cleanup.
You might ask about allowing this when txn.mu.finalized is set and then I don't think I'd have an answer. But I'd leave this as is...

pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we commit this or roll it back? I agree that needing a transaction during stmt preparing is strange. It's even stranger that the txn would commit. If we're relying on this commit then that seems like a problem.

I agree that it's all crap, but I don't think that rolling back would be any better. If anything, trying to commit and getting an error might be an indication that the planning that was done... cannot be relied upon

Comments from Reviewable

The txn.Exec() interface was living with some configurability vestiges of a time when the Executor used it and needed fine control over it. That's no longer the case, so txn.Exec() can always attempt to commit the txn. Release note: None

We were not cleaning up a txn on errors. Release note: None

A test about initialization of txn timestamps hasn't made sense since we we made the txn ctor always init the timestamp. Release note: None

A test checking that a retriable error doesn't cause the wrong txn to be retried doesn't actually need to create a second txn; it can more easily simulate the desired error. Release note: None

Two tests were using the awkward txn.Exec() interface with the NoRetries execution option. This was silly; they were not getting anything from the interface. Migrated away. Release note: None

The weird txn.Exec() method can now be private, as the only used is db.Txn(). This is good; that method never made sense for other users: it commits but doesn't clean up, it does retries but doesn't initialize the txn (so what happens if the transaction has been used before txn.Exec() ?). Users should just use db.Txn() or, if that's not enough for them, should use the txn directly and commit it / roll it back. Release note: None

And the similar one in the old Executor. We were leaking txns used for preparing statements. I hope it was always benign as hopefully these txns are not used to perform any operations... Release note: None

Before this patch, the TxnCoordSender had a role in cleaning up abandoned transactions, where abandoned was defined in a funky way: 1) for txns with a cancelable context, a txn was abandoned if the ctx had been canceled. 2) for the other txns, the txn was considered abandoned if it ran for 10 seconds. SQL only triggers 1). The 10 second timeout was a relic of the past, from a time where the coordinator was possibly removed from the client. No more, they have been collocated for years. In fact, the whole idea of the TCS detecting abandoned txns is a relic of that past: all the txn users are already always cleaning up after themselves. The pressing motivation for this change is that there's an issue with checking the context: there is currently no way to pass a ctx with the same lifetime as the txn to the TxnCoordSender (as it is only a Sender). And so the TCS was capturing the ctx of the first Send() op as the ctx of the txn: this was a major hack that's a big problem - we can't have per-statement contexts, for example. This also restricts the possible implementations of statement cancelation: we have been forced to cancel the whole transaction, which is not ideal. This patch removes TCS' involvement in cleanup by relying on the client to always send an EndTransaction. SQL was already doing that, as was everybody else. The TCS is simplified. The GCQueue maintains its role in detecting abandoned transactions - it coninues to do so with an one hour timeout. This detection is needed, of course, since the client is not collocated with the txn's range. This patch removes the txn.Abandons metric which was maintained by the TCS. There is a similar metric maintained by the GCQueue. The TCS one was incorrect anyway, as it was being incremented on ctx cancelation. Fixes cockroachdb#26524 Release note: None

Before this patch, the heartbeat loop in the TxnCoordSender would abort the transaction if any heartbeat fails. This behavior was not called for - generally, just because a heartbeat RPC failed doesn't mean the txn cannot continue. This patch makes the heartbeat swallow the errors - except if the transaction is found to have been aborted, in which case the heartbeat loop is stopped. These async aborts were also ugly because they introduce more asynchrony into the TCS, below the client.Txn. With this patch, the only situation where the TCS cleans something up from underneath the client.Txn is when the regular request path (not the heartbeats) gets a TransactionAbortedError, in which case the client.Txn will start using a new transaction internally and not use the old TCS any more (it will create a new one). Release note: None

Release note: None

andreimatei · 2018-06-11T21:37:18Z

bors r+

Review status: complete! 0 of 0 LGTMs obtained (and 2 stale)

Comments from Reviewable

23055: kv: stop concerning the TxnCoordSender with abandoned txns r=andreimatei a=andreimatei Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads: kv: stop concerning the TxnCoordSender with abandoned txns … Before this patch, the TxnCoordSender had a role in cleaning up abandoned transactions, where abandoned was defined in a funky way: 1) for txns with a cancelable context, a txn was abandoned if the ctx had been canceled. 2) for the other txns, the txn was considered abandoned if it ran for 10 seconds. SQL only triggers 1). The 10 second timeout was a relicv of the past, from a time where the coordinator was possibly removed from the client. No more, they have been collocated for years. In fact, the whole idea of the TCS detecting abandoned txns is a relicv of that past: all the txn users are already always cleaning up after themselves. The pressing motivation for this change is that there's an issue with checking the context: there is currently no way to pass a ctx with the same lifetime as the txn to the TxnCoordSender (as it is only a Sender). And so the TCS was capturing the ctx of the first Send() op as the ctx of the txn: this was a major hack that's a big problem - we can't have per-statement contexts, for example. This also restricts the possible implementations of statement cancelation: we have been forced to cancel the whole transaction, which is not ideal. This patch removes TCS' involvement in cleanup by relying on the client to always send an EndTransaction. SQL was already doing that, as was everybody else. The TCS is simplified. The GCQueue maintains its role in detecting abandoned transactions - it coninues to do so with an one hour timeout. This detection is needed, of course, since the client is not collocated with the txn's range. This patch removes the txn.Abandons metric which was maintained by the TCS. There is a similar metric maintained by the GCQueue. The TCS one was incorrect anyway, as it was being incremented on ctx cancelation. Release note: None 26598: roachtest: Make election test work on release-2.0 r=a-robinson a=a-robinson The test was assuming the existence of a default database, which isn't present in 2.0. Fixes #26562 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>

craig · 2018-06-11T21:59:22Z

Build succeeded

GitHub CI (Cockroach)

tbg · 2018-06-12T17:31:39Z

Nice, glad to see this land!

andreimatei assigned spencerkimball Feb 25, 2018

andreimatei requested review from a team February 25, 2018 01:59

couchand reviewed Mar 5, 2018

View reviewed changes

This was referenced Mar 17, 2018

*: Tie heartbeat goroutines to associated processing #23980

Closed

kv: push abort txns after heartbeat failure instead of EndTxn #24048

Closed

andreimatei mentioned this pull request Apr 11, 2018

tracing: allow specifying parent traceid/spanid when starting a trace #19403

Open

andreimatei force-pushed the txn-ctx branch from ebf7177 to a1ffb00 Compare May 4, 2018 18:44

andreimatei requested a review from a team May 4, 2018 18:44

andreimatei force-pushed the txn-ctx branch from a1ffb00 to b413daf Compare May 5, 2018 01:29

andreimatei force-pushed the txn-ctx branch from 69690ef to 2cf074e Compare May 17, 2018 00:10

andreimatei requested a review from a team May 17, 2018 00:10

andreimatei force-pushed the txn-ctx branch 2 times, most recently from 6fe54dd to d953565 Compare May 17, 2018 00:39

andreimatei force-pushed the txn-ctx branch 3 times, most recently from 2aa2e1a to 9382960 Compare May 18, 2018 21:19

andreimatei force-pushed the txn-ctx branch from 9382960 to 29f9280 Compare June 1, 2018 21:54

andreimatei requested a review from a team June 1, 2018 21:54

andreimatei mentioned this pull request Jun 7, 2018

kv: decompose TxnCoordSender into a stack of txnReqInterceptors #26496

Merged

andreimatei force-pushed the txn-ctx branch from 1e8be01 to 641789b Compare June 7, 2018 19:20

andreimatei mentioned this pull request Jun 7, 2018

teamcity: failed tests on master: testrace/TestSessionFinishRollsBackTxn #26524

Closed

andreimatei mentioned this pull request Jun 7, 2018

sql: skip flaky TestSessionFinishRollsBackTxn #26530

Closed

andreimatei force-pushed the txn-ctx branch from eeefcd5 to 14f6670 Compare June 7, 2018 21:09

andreimatei force-pushed the txn-ctx branch 2 times, most recently from e384cc3 to f7f6c97 Compare June 8, 2018 06:06

andreimatei mentioned this pull request Jun 8, 2018

sql: Fix a root span leak in the InternalExecutor #26552

Merged

andreimatei force-pushed the txn-ctx branch from f7f6c97 to d8cf8a8 Compare June 11, 2018 20:30

andreimatei added 11 commits June 11, 2018 17:35

internal/client: remove dead options

a9dfd98

The txn.Exec() interface was living with some configurability vestiges of a time when the Executor used it and needed fine control over it. That's no longer the case, so txn.Exec() can always attempt to commit the txn. Release note: None

backupccl: plug txn leak

8546492

We were not cleaning up a txn on errors. Release note: None

client/txn: delete obsolete test

5fff8ba

A test about initialization of txn timestamps hasn't made sense since we we made the txn ctor always init the timestamp. Release note: None

internal/client: simplify test

cc6aa7b

A test checking that a retriable error doesn't cause the wrong txn to be retried doesn't actually need to create a second txn; it can more easily simulate the desired error. Release note: None

internal/client, storage: migrate tests away from txn.Exec()

8d61ce9

Two tests were using the awkward txn.Exec() interface with the NoRetries execution option. This was silly; they were not getting anything from the interface. Migrated away. Release note: None

sql: prevent txn leak in connExecutor

4aa3d4f

And the similar one in the old Executor. We were leaking txns used for preparing statements. I hope it was always benign as hopefully these txns are not used to perform any operations... Release note: None

client: Make async rollback a stopper task

b38728d

Release note: None

client: remove a now-useless context fork in db.Txn()

f3e9650

Release note: None

andreimatei force-pushed the txn-ctx branch from d8cf8a8 to f3e9650 Compare June 11, 2018 21:36

craig bot merged commit f3e9650 into cockroachdb:master Jun 11, 2018

andreimatei deleted the txn-ctx branch June 12, 2018 16:59

Conversation

andreimatei commented Feb 25, 2018

Uh oh!

cockroach-teamcity commented Feb 25, 2018

Uh oh!

andreimatei commented Feb 25, 2018

Uh oh!

bdarnell commented Feb 25, 2018

Uh oh!

andreimatei commented Feb 25, 2018

Uh oh!

couchand Mar 5, 2018

Choose a reason for hiding this comment

Uh oh!

andreimatei commented May 2, 2018

Uh oh!

andreimatei commented May 5, 2018

Uh oh!

bdarnell commented May 6, 2018

Uh oh!

andreimatei commented May 17, 2018

Uh oh!

nvb commented Jun 7, 2018

Uh oh!

andreimatei commented Jun 7, 2018

Uh oh!

andreimatei commented Jun 11, 2018

Uh oh!

craig bot commented Jun 11, 2018

Build succeeded

Uh oh!

tbg commented Jun 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants