Skip to content

kv: stop concerning the TxnCoordSender with abandoned txns#23055

Merged
craig[bot] merged 11 commits intocockroachdb:masterfrom
andreimatei:txn-ctx
Jun 11, 2018
Merged

kv: stop concerning the TxnCoordSender with abandoned txns#23055
craig[bot] merged 11 commits intocockroachdb:masterfrom
andreimatei:txn-ctx

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads:

kv: stop concerning the TxnCoordSender with abandoned txns …
Before this patch, the TxnCoordSender had a role in cleaning up
abandoned transactions, where abandoned was defined in a funky way:

  1. for txns with a cancelable context, a txn was abandoned if the ctx
    had been canceled.
  2. for the other txns, the txn was considered abandoned if it ran for 10
    seconds.

SQL only triggers 1).

The 10 second timeout was a relicv of the past, from a time where the
coordinator was possibly removed from the client. No more, they have
been collocated for years.
In fact, the whole idea of the TCS detecting abandoned txns is a relicv
of that past: all the txn users are already always cleaning up after
themselves.

The pressing motivation for this change is that there's an issue with
checking the context: there is currently no way to pass a ctx with the
same lifetime as the txn to the TxnCoordSender (as it is only a Sender).
And so the TCS was capturing the ctx of the first Send() op as the ctx
of the txn: this was a major hack that's a big problem - we can't have
per-statement contexts, for example. This also restricts the possible
implementations of statement cancelation: we have been forced to cancel
the whole transaction, which is not ideal.

This patch removes TCS' involvement in cleanup by relying on the client
to always send an EndTransaction. SQL was already doing that, as was
everybody else. The TCS is simplified.

The GCQueue maintains its role in detecting abandoned transactions - it
coninues to do so with an one hour timeout. This detection is needed, of
course, since the client is not collocated with the txn's range.

This patch removes the txn.Abandons metric which was maintained by the
TCS. There is a similar metric maintained by the GCQueue. The TCS one
was incorrect anyway, as it was being incremented on ctx cancelation.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Spencer, I can find another reviewer if you don't want to do it.
cc @jordanlewis to note that the nasty ctx capture in TxnCS is going away.

There's a commented out test for which I have to figure out what to do: it's spirit might be worth preserving.

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm: for 2.1 (but I wouldn't cherry-pick it to 2.0, and so we might want to wait to merge this to minimize complications from any other bug-fixing that may go on in this area in the next few weeks)


Reviewed 1 of 1 files at r2, 1 of 1 files at r4, 2 of 2 files at r5, 1 of 2 files at r6, 2 of 2 files at r7, 4 of 4 files at r8.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file):

// !!!
// // TestTxnReadAfterAbandon checks the fix for the condition in issue #4787:

This seems fine to delete to me - how would this test make sense in the absence of abandonment?


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

        <Metric name="cr.node.txn.commits1PC" title="Fast-path Committed" nonNegativeRate />
        <Metric name="cr.node.txn.aborts" title="Aborted" nonNegativeRate />
        // !!!

I'd just remove the next line.


Comments from Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Yeah, I definitely won't cherry-pick this, it's the kind of change that can have some unintended consequences. For example I already know one that I need to figure out a story for - we sometimes cancel high-level contexts that still result in sending rollback requests (for example when the network connection drops). Except the rollbacks never get sent because the context is canceled (and so gRPC or someone else along the way refuses to send anything). I believe things used to work to some extent because the heartbeat loop would have detected the cancelation and send an EndTransaction using a different, not canceled, context specially built for the purpose. So we used to both call something like txn.CleanupOnError() - which would have surprisingly been a no-op, and independently the heartbeat loop would send an EndTransaction that actually does something. I need to figure out what the right thing to do is to make sure that we can send rollbacks even with a canceled ctx.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems fine to delete to me - how would this test make sense in the absence of abandonment?

well, what I believe makes sense to test is that sending through a txn whose heartbeat had previously failed (async) is somehow rejected


Comments from Reviewable

<Metric name="cr.node.txn.commits" title="Committed" nonNegativeRate />
<Metric name="cr.node.txn.commits1PC" title="Fast-path Committed" nonNegativeRate />
<Metric name="cr.node.txn.aborts" title="Aborted" nonNegativeRate />
// !!!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

@andreimatei
Copy link
Copy Markdown
Contributor Author

note to self: plan is to check ctx status in txn.rollback() and send a (possibly second) EndTxn on a different ctx.


Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Ben, please take another look.
I've made txn.Rollback() work with a canceled context.

There's 2 things left - I've made TestHeartbeatCallbackForDecommissioning flaky. I hope @tschottdorf can tell me whats up.
And I've added a TestCleanupWithCanceledContext to check the txn.Rollback() behavior. Except it passes with the old code too. I think it only fails if the txn record is on a remote node - so I need an RPC. And so I need a split in the keys so I can do my txn on a remote range. Any suggestion for the easiest/best way to write that test would be most welcomed.


Review status: 0 of 21 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd just remove the next line.

Done.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):

Previously, couchand (Andrew Couch) wrote…

???

I was asking about just removing this, but not in so many words.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented May 6, 2018

Reviewed 1 of 1 files at r10, 1 of 2 files at r13, 1 of 1 files at r15, 18 of 18 files at r16.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/internal/client/txn.go, line 106 at r16 (raw file):

// distributed transactions (leaf).
//
// If the transactions is used to send any operations, CommitOrCleanup() or

s/transactions/transaction/


pkg/internal/client/txn.go, line 599 at r16 (raw file):

}

func (txn *Txn) rollback(ctx context.Context) *roachpb.Error {

I assume this copy of the method is going away?


pkg/internal/client/txn.go, line 631 at r16 (raw file):

	ctx = txn.db.AnnotateCtx(context.Background())
	go func() {

I think this should be a stopper task.


pkg/internal/client/txn.go, line 687 at r16 (raw file):

// TxnExecOptions controls how Exec() runs a transaction and the corresponding
// closure.
type TxnExecOptions struct {

I'm glad to see this simplification.


pkg/internal/client/txn.go, line 875 at r16 (raw file):

		sender = txn.mu.sender
		if txn.mu.Proto.Status != roachpb.PENDING || txn.mu.finalized {
			onlyRolback := lastIndex == 0 && haveEndTxn && !endTxnRequest.Commit

s/Rolback/Rollback/


pkg/kv/txn_coord_sender.go, line 960 at r16 (raw file):

			return
		case <-tc.stopper.ShouldQuiesce():
			// TODO(andrei): should we tryAsyncAbort() here?

Probably not. This is the point at which new tasks can't be created so I'm not sure if it would even work.


Comments from Reviewable

@andreimatei andreimatei requested a review from a team May 17, 2018 00:10
@andreimatei andreimatei force-pushed the txn-ctx branch 2 times, most recently from 6fe54dd to d953565 Compare May 17, 2018 00:39
@andreimatei
Copy link
Copy Markdown
Contributor Author

first 3 commits are #25541

I've added a commit getting rid of the TCS async aborts.
PTAL, a bunch of stuff changed in the kv: stop concerning the TxnCoordSender with abandoned txns commit too.
Thanks!
cc @tschottdorf

Note to self: DNM until #25586 is elucidated.


Review status: 0 of 39 files reviewed at latest revision, 7 unresolved discussions.


pkg/internal/client/txn.go, line 106 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/transactions/transaction/

Done.


pkg/internal/client/txn.go, line 599 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I assume this copy of the method is going away?

yes, gone


pkg/internal/client/txn.go, line 631 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this should be a stopper task.

will do


pkg/internal/client/txn.go, line 875 at r16 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/Rolback/Rollback/

Done.


Comments from Reviewable

@andreimatei andreimatei force-pushed the txn-ctx branch 3 times, most recently from 2aa2e1a to 9382960 Compare May 18, 2018 21:19
@andreimatei andreimatei requested a review from a team June 1, 2018 21:54
@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 7, 2018

:lgtm: although s/relicv/relic/g


Reviewed 1 of 23 files at r27, 5 of 5 files at r28, 1 of 1 files at r29, 1 of 1 files at r30, 1 of 1 files at r31, 2 of 2 files at r32, 2 of 2 files at r33, 1 of 1 files at r34, 18 of 19 files at r35, 1 of 4 files at r36, 5 of 5 files at r37, 1 of 1 files at r38.
Review status: 20 of 23 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/internal/client/client_test.go, line 1112 at r35 (raw file):

	// Do a Get using a different ctx (not the canceled one), and check that it
	// didn't take too long - take that as proof that it was not blocked on
	// intents.

The transaction would take a long time if the intents were blocked because .... (something about that transaction abandoned timeout)


pkg/internal/client/txn.go, line 664 at r35 (raw file):

	ctx = txn.db.AnnotateCtx(context.Background())
	// !!! This needs to be a stopper task; plumb the stopper to the DB.

You're not merging this, right?

EDIT: squash the commit that fixes this.


pkg/internal/client/txn.go, line 934 at r35 (raw file):

		sender = txn.mu.sender
		if txn.mu.Proto.Status != roachpb.PENDING || txn.mu.finalized {
			onlyRollback := lastIndex == 0 && haveEndTxn && !endTxnRequest.Commit

What's the point of allowing this rollback through when txn.mu.Proto.Status != roachpb.PENDING?


pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file):

		return nil, err
	}
	if err := txn.CommitOrCleanup(ctx); err != nil {

Should we commit this or roll it back? I agree that needing a transaction during stmt preparing is strange. It's even stranger that the txn would commit. If we're relying on this commit then that seems like a problem.


Comments from Reviewable

andreimatei added a commit to andreimatei/cockroach that referenced this pull request Jun 7, 2018
Copying from cockroachdb#26524
What's going on here is that this test relies on a rollback (kv-level)
working with a canceled context. It never really worked (the first
rollback attempt failed), however certain canceled contexts were also
stopping the heartbeat loop as a byproduct, and that was doing its own
cleanup. The recent cockroachdb#26479 changed things, however - there would be no
more stopping of the heartbeat loop after a failed rollback attempt.
This is a legit enough regression from cockroachdb#26479, but it's being fixed in
the imminent cockroachdb#23055 which makes the rollbacks with canceled contexts
work properly.

Touches cockroachdb#26524

Release note: None
@andreimatei
Copy link
Copy Markdown
Contributor Author

Review status: 20 of 23 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/internal/client/client_test.go, line 1112 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The transaction would take a long time if the intents were blocked because .... (something about that transaction abandoned timeout)

added more words


pkg/internal/client/txn.go, line 664 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You're not merging this, right?

EDIT: squash the commit that fixes this.

Done.


pkg/internal/client/txn.go, line 934 at r35 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's the point of allowing this rollback through when txn.mu.Proto.Status != roachpb.PENDING?

well if the status is Aborted, there might still be intents to cleanup.
You might ask about allowing this when txn.mu.finalized is set and then I don't think I'd have an answer. But I'd leave this as is...


pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we commit this or roll it back? I agree that needing a transaction during stmt preparing is strange. It's even stranger that the txn would commit. If we're relying on this commit then that seems like a problem.

I agree that it's all crap, but I don't think that rolling back would be any better. If anything, trying to commit and getting an error might be an indication that the planning that was done... cannot be relied upon


Comments from Reviewable

The txn.Exec() interface was living with some configurability vestiges
of a time when the Executor used it and needed fine control over it.
That's no longer the case, so txn.Exec() can always attempt to commit
the txn.

Release note: None
We were not cleaning up a txn on errors.

Release note: None
A test about initialization of txn timestamps hasn't made sense since we
we made the txn ctor always init the timestamp.

Release note: None
A test checking that a retriable error doesn't cause the wrong txn to be
retried doesn't actually need to create a second txn; it can more
easily simulate the desired error.

Release note: None
Two tests were using the awkward txn.Exec() interface with the NoRetries
execution option. This was silly; they were not getting anything from
the interface. Migrated away.

Release note: None
The weird txn.Exec() method can now be private, as the only used is
db.Txn(). This is good; that method never made sense for other users: it
commits but doesn't clean up, it does retries but doesn't initialize the
txn (so what happens if the transaction has been used before txn.Exec()
?). Users should just use db.Txn() or, if that's not enough for them,
should use the txn directly and commit it / roll it back.

Release note: None
And the similar one in the old Executor. We were leaking txns used for
preparing statements. I hope it was always benign as hopefully these
txns are not used to perform any operations...

Release note: None
Before this patch, the TxnCoordSender had a role in cleaning up
abandoned transactions, where abandoned was defined in a funky way:
1) for txns with a cancelable context, a txn was abandoned if the ctx
had been canceled.
2) for the other txns, the txn was considered abandoned if it ran for 10
seconds.

SQL only triggers 1).

The 10 second timeout was a relic of the past, from a time where the
coordinator was possibly removed from the client. No more, they have
been collocated for years.
In fact, the whole idea of the TCS detecting abandoned txns is a relic
of that past: all the txn users are already always cleaning up after
themselves.

The pressing motivation for this change is that there's an issue with
checking the context: there is currently no way to pass a ctx with the
same lifetime as the txn to the TxnCoordSender (as it is only a Sender).
And so the TCS was capturing the ctx of the first Send() op as the ctx
of the txn: this was a major hack that's a big problem - we can't have
per-statement contexts, for example. This also restricts the possible
implementations of statement cancelation: we have been forced to cancel
the whole transaction, which is not ideal.

This patch removes TCS' involvement in cleanup by relying on the client
to always send an EndTransaction. SQL was already doing that, as was
everybody else. The TCS is simplified.

The GCQueue maintains its role in detecting abandoned transactions - it
coninues to do so with an one hour timeout. This detection is needed, of
course, since the client is not collocated with the txn's range.

This patch removes the txn.Abandons metric which was maintained by the
TCS. There is a similar metric maintained by the GCQueue. The TCS one
was incorrect anyway, as it was being incremented on ctx cancelation.

Fixes cockroachdb#26524

Release note: None
Before this patch, the heartbeat loop in the TxnCoordSender would abort
the transaction if any heartbeat fails. This behavior was not called for
- generally, just because a heartbeat RPC failed doesn't mean the txn
cannot continue. This patch makes the heartbeat swallow the errors -
except if the transaction is found to have been aborted, in which case
the heartbeat loop is stopped.

These async aborts were also ugly because they introduce more asynchrony
into the TCS, below the client.Txn. With this patch, the only situation
where the TCS cleans something up from underneath the client.Txn is when
the regular request path (not the heartbeats) gets a
TransactionAbortedError, in which case the client.Txn will start using a
new transaction internally and not use the old TCS any more (it will
create a new one).

Release note: None
@andreimatei
Copy link
Copy Markdown
Contributor Author

bors r+


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jun 11, 2018
23055: kv: stop concerning the TxnCoordSender with abandoned txns r=andreimatei a=andreimatei

Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads:


kv: stop concerning the TxnCoordSender with abandoned txns  …
Before this patch, the TxnCoordSender had a role in cleaning up
abandoned transactions, where abandoned was defined in a funky way:
1) for txns with a cancelable context, a txn was abandoned if the ctx
had been canceled.
2) for the other txns, the txn was considered abandoned if it ran for 10
seconds.

SQL only triggers 1).

The 10 second timeout was a relicv of the past, from a time where the
coordinator was possibly removed from the client. No more, they have
been collocated for years.
In fact, the whole idea of the TCS detecting abandoned txns is a relicv
of that past: all the txn users are already always cleaning up after
themselves.

The pressing motivation for this change is that there's an issue with
checking the context: there is currently no way to pass a ctx with the
same lifetime as the txn to the TxnCoordSender (as it is only a Sender).
And so the TCS was capturing the ctx of the first Send() op as the ctx
of the txn: this was a major hack that's a big problem - we can't have
per-statement contexts, for example. This also restricts the possible
implementations of statement cancelation: we have been forced to cancel
the whole transaction, which is not ideal.

This patch removes TCS' involvement in cleanup by relying on the client
to always send an EndTransaction. SQL was already doing that, as was
everybody else. The TCS is simplified.

The GCQueue maintains its role in detecting abandoned transactions - it
coninues to do so with an one hour timeout. This detection is needed, of
course, since the client is not collocated with the txn's range.

This patch removes the txn.Abandons metric which was maintained by the
TCS. There is a similar metric maintained by the GCQueue. The TCS one
was incorrect anyway, as it was being incremented on ctx cancelation.

Release note: None

26598: roachtest: Make election test work on release-2.0 r=a-robinson a=a-robinson

The test was assuming the existence of a default database, which
isn't present in 2.0.

Fixes #26562

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 11, 2018

Build succeeded

@craig craig bot merged commit f3e9650 into cockroachdb:master Jun 11, 2018
@andreimatei andreimatei deleted the txn-ctx branch June 12, 2018 16:59
@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 12, 2018

Nice, glad to see this land!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants