kv: stop concerning the TxnCoordSender with abandoned txns#23055
kv: stop concerning the TxnCoordSender with abandoned txns#23055craig[bot] merged 11 commits intocockroachdb:masterfrom
Conversation
|
Spencer, I can find another reviewer if you don't want to do it. There's a commented out test for which I have to figure out what to do: it's spirit might be worth preserving. |
|
Reviewed 1 of 1 files at r2, 1 of 1 files at r4, 2 of 2 files at r5, 1 of 2 files at r6, 2 of 2 files at r7, 4 of 4 files at r8. pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file):
This seems fine to delete to me - how would this test make sense in the absence of abandonment? pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file):
I'd just remove the next line. Comments from Reviewable |
|
Yeah, I definitely won't cherry-pick this, it's the kind of change that can have some unintended consequences. For example I already know one that I need to figure out a story for - we sometimes cancel high-level contexts that still result in sending rollback requests (for example when the network connection drops). Except the rollbacks never get sent because the context is canceled (and so gRPC or someone else along the way refuses to send anything). I believe things used to work to some extent because the heartbeat loop would have detected the cancelation and send an Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. pkg/kv/txn_coord_sender_test.go, line 1368 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
well, what I believe makes sense to test is that sending through a txn whose heartbeat had previously failed (async) is somehow rejected Comments from Reviewable |
| <Metric name="cr.node.txn.commits" title="Committed" nonNegativeRate /> | ||
| <Metric name="cr.node.txn.commits1PC" title="Fast-path Committed" nonNegativeRate /> | ||
| <Metric name="cr.node.txn.aborts" title="Aborted" nonNegativeRate /> | ||
| // !!! |
|
note to self: plan is to check ctx status in Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed. Comments from Reviewable |
|
Ben, please take another look. There's 2 things left - I've made Review status: 0 of 21 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/distributed.tsx, line 40 at r8 (raw file): Previously, couchand (Andrew Couch) wrote…
I was asking about just removing this, but not in so many words. Comments from Reviewable |
|
Reviewed 1 of 1 files at r10, 1 of 2 files at r13, 1 of 1 files at r15, 18 of 18 files at r16. pkg/internal/client/txn.go, line 106 at r16 (raw file):
s/transactions/transaction/ pkg/internal/client/txn.go, line 599 at r16 (raw file):
I assume this copy of the method is going away? pkg/internal/client/txn.go, line 631 at r16 (raw file):
I think this should be a stopper task. pkg/internal/client/txn.go, line 687 at r16 (raw file):
I'm glad to see this simplification. pkg/internal/client/txn.go, line 875 at r16 (raw file):
s/Rolback/Rollback/ pkg/kv/txn_coord_sender.go, line 960 at r16 (raw file):
Probably not. This is the point at which new tasks can't be created so I'm not sure if it would even work. Comments from Reviewable |
6fe54dd to
d953565
Compare
|
first 3 commits are #25541 I've added a commit getting rid of the TCS async aborts. Note to self: DNM until #25586 is elucidated. Review status: 0 of 39 files reviewed at latest revision, 7 unresolved discussions. pkg/internal/client/txn.go, line 106 at r16 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/internal/client/txn.go, line 599 at r16 (raw file): Previously, bdarnell (Ben Darnell) wrote…
yes, gone pkg/internal/client/txn.go, line 631 at r16 (raw file): Previously, bdarnell (Ben Darnell) wrote…
will do pkg/internal/client/txn.go, line 875 at r16 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
2aa2e1a to
9382960
Compare
|
Reviewed 1 of 23 files at r27, 5 of 5 files at r28, 1 of 1 files at r29, 1 of 1 files at r30, 1 of 1 files at r31, 2 of 2 files at r32, 2 of 2 files at r33, 1 of 1 files at r34, 18 of 19 files at r35, 1 of 4 files at r36, 5 of 5 files at r37, 1 of 1 files at r38. pkg/internal/client/client_test.go, line 1112 at r35 (raw file):
The transaction would take a long time if the intents were blocked because .... (something about that transaction abandoned timeout) pkg/internal/client/txn.go, line 664 at r35 (raw file):
You're not merging this, right? EDIT: squash the commit that fixes this. pkg/internal/client/txn.go, line 934 at r35 (raw file):
What's the point of allowing this rollback through when pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file):
Should we commit this or roll it back? I agree that needing a transaction during stmt preparing is strange. It's even stranger that the txn would commit. If we're relying on this commit then that seems like a problem. Comments from Reviewable |
Copying from cockroachdb#26524 What's going on here is that this test relies on a rollback (kv-level) working with a canceled context. It never really worked (the first rollback attempt failed), however certain canceled contexts were also stopping the heartbeat loop as a byproduct, and that was doing its own cleanup. The recent cockroachdb#26479 changed things, however - there would be no more stopping of the heartbeat loop after a failed rollback attempt. This is a legit enough regression from cockroachdb#26479, but it's being fixed in the imminent cockroachdb#23055 which makes the rollbacks with canceled contexts work properly. Touches cockroachdb#26524 Release note: None
|
Review status: 20 of 23 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. pkg/internal/client/client_test.go, line 1112 at r35 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
added more words pkg/internal/client/txn.go, line 664 at r35 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/internal/client/txn.go, line 934 at r35 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
well if the status is pkg/sql/conn_executor_prepare.go, line 225 at r34 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I agree that it's all crap, but I don't think that rolling back would be any better. If anything, trying to commit and getting an error might be an indication that the planning that was done... cannot be relied upon Comments from Reviewable |
e384cc3 to
f7f6c97
Compare
The txn.Exec() interface was living with some configurability vestiges of a time when the Executor used it and needed fine control over it. That's no longer the case, so txn.Exec() can always attempt to commit the txn. Release note: None
We were not cleaning up a txn on errors. Release note: None
A test about initialization of txn timestamps hasn't made sense since we we made the txn ctor always init the timestamp. Release note: None
A test checking that a retriable error doesn't cause the wrong txn to be retried doesn't actually need to create a second txn; it can more easily simulate the desired error. Release note: None
Two tests were using the awkward txn.Exec() interface with the NoRetries execution option. This was silly; they were not getting anything from the interface. Migrated away. Release note: None
The weird txn.Exec() method can now be private, as the only used is db.Txn(). This is good; that method never made sense for other users: it commits but doesn't clean up, it does retries but doesn't initialize the txn (so what happens if the transaction has been used before txn.Exec() ?). Users should just use db.Txn() or, if that's not enough for them, should use the txn directly and commit it / roll it back. Release note: None
And the similar one in the old Executor. We were leaking txns used for preparing statements. I hope it was always benign as hopefully these txns are not used to perform any operations... Release note: None
Before this patch, the TxnCoordSender had a role in cleaning up abandoned transactions, where abandoned was defined in a funky way: 1) for txns with a cancelable context, a txn was abandoned if the ctx had been canceled. 2) for the other txns, the txn was considered abandoned if it ran for 10 seconds. SQL only triggers 1). The 10 second timeout was a relic of the past, from a time where the coordinator was possibly removed from the client. No more, they have been collocated for years. In fact, the whole idea of the TCS detecting abandoned txns is a relic of that past: all the txn users are already always cleaning up after themselves. The pressing motivation for this change is that there's an issue with checking the context: there is currently no way to pass a ctx with the same lifetime as the txn to the TxnCoordSender (as it is only a Sender). And so the TCS was capturing the ctx of the first Send() op as the ctx of the txn: this was a major hack that's a big problem - we can't have per-statement contexts, for example. This also restricts the possible implementations of statement cancelation: we have been forced to cancel the whole transaction, which is not ideal. This patch removes TCS' involvement in cleanup by relying on the client to always send an EndTransaction. SQL was already doing that, as was everybody else. The TCS is simplified. The GCQueue maintains its role in detecting abandoned transactions - it coninues to do so with an one hour timeout. This detection is needed, of course, since the client is not collocated with the txn's range. This patch removes the txn.Abandons metric which was maintained by the TCS. There is a similar metric maintained by the GCQueue. The TCS one was incorrect anyway, as it was being incremented on ctx cancelation. Fixes cockroachdb#26524 Release note: None
Before this patch, the heartbeat loop in the TxnCoordSender would abort the transaction if any heartbeat fails. This behavior was not called for - generally, just because a heartbeat RPC failed doesn't mean the txn cannot continue. This patch makes the heartbeat swallow the errors - except if the transaction is found to have been aborted, in which case the heartbeat loop is stopped. These async aborts were also ugly because they introduce more asynchrony into the TCS, below the client.Txn. With this patch, the only situation where the TCS cleans something up from underneath the client.Txn is when the regular request path (not the heartbeats) gets a TransactionAbortedError, in which case the client.Txn will start using a new transaction internally and not use the old TCS any more (it will create a new one). Release note: None
Release note: None
Release note: None
|
bors r+ Review status: Comments from Reviewable |
23055: kv: stop concerning the TxnCoordSender with abandoned txns r=andreimatei a=andreimatei Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads: kv: stop concerning the TxnCoordSender with abandoned txns … Before this patch, the TxnCoordSender had a role in cleaning up abandoned transactions, where abandoned was defined in a funky way: 1) for txns with a cancelable context, a txn was abandoned if the ctx had been canceled. 2) for the other txns, the txn was considered abandoned if it ran for 10 seconds. SQL only triggers 1). The 10 second timeout was a relicv of the past, from a time where the coordinator was possibly removed from the client. No more, they have been collocated for years. In fact, the whole idea of the TCS detecting abandoned txns is a relicv of that past: all the txn users are already always cleaning up after themselves. The pressing motivation for this change is that there's an issue with checking the context: there is currently no way to pass a ctx with the same lifetime as the txn to the TxnCoordSender (as it is only a Sender). And so the TCS was capturing the ctx of the first Send() op as the ctx of the txn: this was a major hack that's a big problem - we can't have per-statement contexts, for example. This also restricts the possible implementations of statement cancelation: we have been forced to cancel the whole transaction, which is not ideal. This patch removes TCS' involvement in cleanup by relying on the client to always send an EndTransaction. SQL was already doing that, as was everybody else. The TCS is simplified. The GCQueue maintains its role in detecting abandoned transactions - it coninues to do so with an one hour timeout. This detection is needed, of course, since the client is not collocated with the txn's range. This patch removes the txn.Abandons metric which was maintained by the TCS. There is a similar metric maintained by the GCQueue. The TCS one was incorrect anyway, as it was being incremented on ctx cancelation. Release note: None 26598: roachtest: Make election test work on release-2.0 r=a-robinson a=a-robinson The test was assuming the existence of a default database, which isn't present in 2.0. Fixes #26562 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
Build succeeded |
|
Nice, glad to see this land! |
Assorted cleanups of the client.Txn interface and the TxnCoordSender. See individual commits. The main point is the last one, which reads:
kv: stop concerning the TxnCoordSender with abandoned txns …
Before this patch, the TxnCoordSender had a role in cleaning up
abandoned transactions, where abandoned was defined in a funky way:
had been canceled.
seconds.
SQL only triggers 1).
The 10 second timeout was a relicv of the past, from a time where the
coordinator was possibly removed from the client. No more, they have
been collocated for years.
In fact, the whole idea of the TCS detecting abandoned txns is a relicv
of that past: all the txn users are already always cleaning up after
themselves.
The pressing motivation for this change is that there's an issue with
checking the context: there is currently no way to pass a ctx with the
same lifetime as the txn to the TxnCoordSender (as it is only a Sender).
And so the TCS was capturing the ctx of the first Send() op as the ctx
of the txn: this was a major hack that's a big problem - we can't have
per-statement contexts, for example. This also restricts the possible
implementations of statement cancelation: we have been forced to cancel
the whole transaction, which is not ideal.
This patch removes TCS' involvement in cleanup by relying on the client
to always send an EndTransaction. SQL was already doing that, as was
everybody else. The TCS is simplified.
The GCQueue maintains its role in detecting abandoned transactions - it
coninues to do so with an one hour timeout. This detection is needed, of
course, since the client is not collocated with the txn's range.
This patch removes the txn.Abandons metric which was maintained by the
TCS. There is a similar metric maintained by the GCQueue. The TCS one
was incorrect anyway, as it was being incremented on ctx cancelation.
Release note: None