client/sql: fix concurrent Txn batches in the face of retry errors#17627
client/sql: fix concurrent Txn batches in the face of retry errors#17627nvb merged 4 commits intocockroachdb:masterfrom
Conversation
|
Reviewed 4 of 4 files at r1, 12 of 12 files at r2. pkg/internal/client/sender.go, line 42 at r2 (raw file):
s/idep/idemp/ pkg/internal/client/txn.go, line 1071 at r2 (raw file):
This comment isn't new, but could you explain what's going on here? pkg/internal/client/txn.go, line 1124 at r2 (raw file):
So what happened here was that pkg/sql/executor.go, line 1421 at r2 (raw file):
That sounds like the right thing to do, right? pkg/sql/session.go, line 665 at r2 (raw file):
this sentence is off. pkg/sql/txn_restart_test.go, line 135 at r2 (raw file):
But if it fails, you don't even know the order. Should have a seed and print it. pkg/sql/txn_restart_test.go, line 605 at r2 (raw file):
"leaving around" doesn't seem so bad, but it they were leaving around committable data -- might be worth adding that word. pkg/sql/txn_restart_test.go, line 684 at r2 (raw file):
How quickly does this fail without your fixes in this PR? Comments from Reviewable |
|
Reviewed 2 of 4 files at r1, 12 of 12 files at r2. pkg/internal/client/sender.go, line 33 at r2 (raw file):
The pkg/sql/txn_restart_test.go, line 135 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
This comes for free if you add our randutil.SeedForTests() in main_test.go Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed. pkg/sql/txn_restart_test.go, line 135 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Not really, unless everything that happens before this test is deterministic (at least in its usage of the rand). Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed. pkg/sql/txn_restart_test.go, line 135 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Or if you're running this test in isolation. So it's not too useful for CI runs, but you could still use it for stress runs of this test. But anyway, being explicit in tests that deliberately use randomness is a good idea. Comments from Reviewable |
|
Reviewed 2 of 4 files at r1. pkg/internal/client/sender.go, line 44 at r2 (raw file):
if a txnID is sufficient here instead of the whole proto, I think it'd be cleaner to just take that. pkg/internal/client/txn.go, line 1106 at r2 (raw file):
I'm not sure why we're returning this error here? I think returning the error is OK in the other place (the non-error case in txn.Send), but perhaps pkg/internal/client/txn.go, line 1137 at r2 (raw file):
pkg/internal/client/txn.go, line 1152 at r2 (raw file):
s/is/if pkg/roachpb/data.go, line 862 at r2 (raw file):
I don't know what an "in-place restart" is :) pkg/sql/executor.go, line 1421 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, I think that's what we should do, at least morally. But actually, was I confused when I wrote this? Can the Anyway, if we add the synchronization here, I'd also leave a TODO about figuring out how to cancel those statements. pkg/sql/parallel_stmts.go, line 44 at r1 (raw file):
nit: s/error set/error-set (otherwise parsing is hard) pkg/sql/parallel_stmts.go, line 56 at r1 (raw file):
consider giving some hints here about how this set is going to be used, or what the higher level might want to do with it pkg/sql/session.go, line 664 at r2 (raw file):
add "(some of these writes might have been performed at the wrong epoch)". Comments from Reviewable |
909cfc1 to
91a585c
Compare
|
Thanks for the reviews Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. pkg/internal/client/sender.go, line 33 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. #17657. pkg/internal/client/sender.go, line 42 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/internal/client/sender.go, line 44 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/internal/client/txn.go, line 1071 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It looks like DistSQL doesn't support concurrent transaction batches because it can't handle one of them getting an Abort error. pkg/internal/client/txn.go, line 1106 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
You're right, swallowing the error is the right thing to do. Also changed the name to pkg/internal/client/txn.go, line 1137 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/internal/client/txn.go, line 1152 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/executor.go, line 1421 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Let me know if my comment makes sense. Also, I'm getting rid of pkg/sql/parallel_stmts.go, line 44 at r1 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/parallel_stmts.go, line 56 at r1 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/session.go, line 664 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/session.go, line 665 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/sql/txn_restart_test.go, line 135 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
So you want me to call pkg/sql/txn_restart_test.go, line 605 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/sql/txn_restart_test.go, line 684 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Within 100 iterations, though we are injecting a lot of retry errors in each test. Comments from Reviewable |
|
Reviewed 13 of 13 files at r3, 13 of 13 files at r4. pkg/sql/txn_restart_test.go, line 135 at r2 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, I guess (ideally Comments from Reviewable |
71ddaf0 to
4eed2f5
Compare
|
I realized that the extra hooks into I've changed the test to properly abort the meta record when it injects a |
4eed2f5 to
0d6b6d9
Compare
|
Ping :) This really needs to get into 1.1. |
|
LGTM except the last commit, for which you'll definitely want Andrei's eyes. Nice simplification though! Reviewed 13 of 13 files at r5, 1 of 1 files at r6, 12 of 12 files at r7. Comments from Reviewable |
|
LGTM Reviewed 1 of 1 files at r6, 12 of 12 files at r7. pkg/internal/client/txn.go, line 1096 at r7 (raw file):
Get rid of the temporary variable; the comment does a better job of explaining what's going on than the variable name does. pkg/roachpb/errors.proto, line 333 at r7 (raw file):
(I think of regular TxnAbortedErrors as being asynchronous in some sense, because they're the result of an abort that is asynchronous with respect to the txn being aborted) pkg/internal/client/sender.go, line 33 at r2 (raw file): Close that issue if it's no longer relevant with the latest refactor of this change. Comments from Reviewable |
|
Reviewed 1 of 13 files at r5, 5 of 12 files at r7. pkg/roachpb/errors.proto, line 273 at r7 (raw file):
superfluous move of pkg/roachpb/errors.proto, line 324 at r7 (raw file):
consider clarifying "by the client" a bit - "by the TxnCoordSender"? pkg/roachpb/errors.proto, line 363 at r7 (raw file):
consider putting these 3 lines back where they were pkg/sql/executor.go, line 1366 at r7 (raw file):
add a TODO here to cancel the parallel queries? pkg/sql/executor.go, line 1366 at r7 (raw file):
also comment that a big reason for the sync is that it'd be incorrect to run statements after the txn.Reset below cause the might write at the wrong epoch. pkg/sql/executor.go, line 1405 at r7 (raw file):
I think you want Comments from Reviewable |
Previously we only held onto the first error seen in a parallel batch. Because there's no way to guarantee that the first error seen contains the root cause of the error, we now hold onto and report all errors for each parallel batch.
These two proto messages were not grouped where they should have been in the file.
Fixes cockroachdb#17197. Before this change, retryable errors did not interact well with concurrent transaction batches. This could lead to strange errors, deadlocks, or in especially bad cases, unintended data being committed. This poor handling of retryable errors made parallel statements using RETURNING NOTHING pretty dangerous to use. This change addresses these complications of concurrent batches. It does this by adding extra logic to the `sessions` handling of the `parallelizeQueue`, bumping a transaction's epoch in the case of retryable errors to assure that no unintended data is ever committed. This has the effect of fixing parallel statements in the face of all retryable errors except for ABORT errors. After fixing this issues, this change adds a few fairly comprehensive tests that stress RETURNING NOTHING in the face of retryable errors.
0d6b6d9 to
a235608
Compare
|
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. pkg/internal/client/txn.go, line 1096 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/roachpb/errors.proto, line 273 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
I move them because the ordering was getting all messed up here. The pattern in this file is that the ErrorDeatil types were all listed up here in order. The mixed up order made it harder to read. Split into a separate commit for explicitness. pkg/roachpb/errors.proto, line 324 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/roachpb/errors.proto, line 333 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Changes to pkg/roachpb/errors.proto, line 363 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
These should all be in order, since this is just a giant union. The move happens in its own commit now though, for clarity of the blame. pkg/sql/executor.go, line 1366 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/executor.go, line 1366 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/sql/executor.go, line 1405 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. pkg/internal/client/sender.go, line 33 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
|
Isn't this that should be backported to 1.0? |
|
|
👍 Thanks for the instructions! |
Parallel statements were not safe for general use because of their behavior when confronted with retryable errors. This was fixed in cockroachdb#17627, but the fix is too big to cherry-pick back into v1.0.
Fixes #17197.
Before this change, retryable errors did not interact well with concurrent
transaction batches. This could lead to strange errors, deadlocks, or in
especially bad cases, unintended data being committed. This poor handling of
retryable errors made parallel statements using RETURNING NOTHING pretty
dangerous to use.
This change addresses these complications of concurrent batches. It begins by
adding extra logic to the
sessionshandling of theparallelizeQueue, bumpinga transaction's epoch in the case of retryable errors to assure that no
unintended data is ever committed. This has the effect of fixing parallel
statements in the face of all retryable errors except for ABORT errors.
It turns out that abort errors were even more tricky because of the interaction
between
client.TxnandTxnCoordSender. Previously, in cases where aborterrors were seen, the
TxnCoordSenderwould stop heartbeating the transactionand forget about it. However, if a successful response or even a non-abort retry
error for the old transaction ID was found later, the
TxnCoordSenderwouldbegin heartbeating again, believing that this was seeing a new Transaction. This
could quickly result in a deadlock if the next incarnation of the Txn began
waiting on an intent from the previous one. The root problem here is that
TxnCoordSenderhas no memory, so it's up to theclient.Txnto prevent this.In doing so, things are a little tricky because the response handling under lock
is split between
TxnCoordSenderandclient.Txn, so responses can bereordered over the API boundary. This serves as a further example of why
client.TxnandTxnCoordSendershould be merged.After fixing these issues, this change adds a few fairly comprehensive tests
that stress RETURNING NOTHING in the face of retryable errors.