distsqlrun,kv,client: don't create flows in aborted txns#29455
distsqlrun,kv,client: don't create flows in aborted txns#29455craig[bot] merged 6 commits intocockroachdb:masterfrom
Conversation
|
cc @jordanlewis |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 14 of 14 files at r1, 10 of 10 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/internal/client/sender.go, line 109 at r2 (raw file):
// AugmentMeta(). // // If AnyTxnStatus is passed, than this function never returns errors.
s/than/then/
pkg/sql/distsql_running.go, line 138 at r2 (raw file):
// If the plan is not local, we will have to set up leaf txns using the // txnCoordMeta. meta, err := txn.GetStrippedTxnCoordMeta(ctx, client.OnlyPending)
Mind defending why this approach is better than just checking meta.Status here and not complicating every other caller of GetStrippedTxnCoordMeta?
fe61784 to
8dd37cb
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/internal/client/sender.go, line 109 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/than/then/
Done.
pkg/sql/distsql_running.go, line 138 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mind defending why this approach is better than just checking
meta.Statushere and not complicating every other caller ofGetStrippedTxnCoordMeta?
As we were discussing, it's cause we want the TCS to be in charge of producing the right error.
Check out the API now - I've moved the "stripped" flavor elsewhere, left the txn with two flavors - one produces errors, one doesn't - and now the TxnStatusOpt is just used by the TCS API, not by the Txn (which is more "client" facing).
7f0b70f to
b2b0cfc
Compare
andreimatei
left a comment
There was a problem hiding this comment.
wrote an integration test. PTAL
Reviewable status:
complete! 0 of 0 LGTMs obtained
... to a standard callback. Will be used marginally in the next commit. Release note: None
... with a pair of methods on the TxnCoordMeta proto. This is in preparation of txn.GetTxnCoordMeta() being split into two flavors in the next commit. Release note: None
60d97c9 to
92b6676
Compare
26901bf to
f486866
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 11 of 11 files at r3, 6 of 6 files at r4, 17 of 17 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 15 of 35 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/internal/client/txn.go, line 138 at r7 (raw file):
log.Fatalf(context.TODO(), "attempting to create txn with nil db for Transaction: %s", meta.Txn) } if meta.Txn.Status != roachpb.PENDING {
Careful. What about mixed version clusters?
pkg/sql/integration_test.go, line 1 at r5 (raw file):
// Copyright 2018 The Cockroach Authors.
sql/integration_test.go is a useless file name. If this doesn't belong in any other test file then name this something descriptive.
pkg/sql/integration_test.go, line 49 at r5 (raw file):
// in it. We're careful to not use the transaction for anything but running the // plan; planning will be performed outside of the transaction. func TestDistSQLRunningInAbortedTxn(t *testing.T) {
Nice test!
pkg/util/tracing/test_utils.go, line 28 at r5 (raw file):
spMsg := "" for _, l := range recSp.Logs { for _, f := range l.Fields {
Why is this needed? Do we expect callers to match against these fields? If so, should we explain this format in the function comment?
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/internal/client/txn.go, line 138 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Careful. What about mixed version clusters?
but I've also added protection in DistSQLServer, a layer above. So DistSQL can't get here with an aborted txn, and that should be the only case where this happened.
pkg/sql/integration_test.go, line 1 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
sql/integration_test.gois a useless file name. If this doesn't belong in any other test file then name this something descriptive.
It's not useless. It's the most descriptive I can come up with. I'm not gonna name a file just for one test. You can imagine other "integration tests" coming here. There's also precedent in kv/integration_test.go :)
pkg/sql/integration_test.go, line 49 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Nice test!
Done.
pkg/util/tracing/test_utils.go, line 28 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this needed? Do we expect callers to match against these fields? If so, should we explain this format in the function comment?
well the log message itself is one of these fields
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/integration_test.go, line 1 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
It's not useless. It's the most descriptive I can come up with. I'm not gonna name a file just for one test. You can imagine other "integration tests" coming here. There's also precedent in
kv/integration_test.go:)
well, I guess this didn't technically end up being an "integration test" after all, it's mostly a test for the DistSQLPlanner.Run() method. Let me see about finding another home.
Before this patch, we had a race between a heartbeat find out (async) that a txn has been aborted and a client using the txn to create DistSQL flows. We could end up creating flows in aborted txns, which are going to have a bad time: at the moment, leaf TxnCoordSenders don't react well to what appears to be an async abort: tcs.maybeRejectClientLocked() returns a HandledRetryableError, but the DistSQL infrastructure only expects raw retriable errors - so the HandledRetryableError was ironically losing its "retryable" character and was making it to a sql client with the wrong pgwire code. This patch resolves the situation by atomically checking that the txn is still good (i.e. status==PENDING) before extracting its metadata that is sent to leaves. Fixes cockroachdb#28898 Fixes cockroachdb#29271 Release note (bug fix): Fix an issue where, under severe load, clients were sometimes receiving retryable errors with a non-retryable error code (a client would get an error with the message "HandledRetryableError: ..." but an internal error code instead of the expected retryable error code).
Release note: None
... now that the gateway is not trying to create them any more. Release note: None
Release note: None
f486866 to
adb01ee
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/integration_test.go, line 1 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well, I guess this didn't technically end up being an "integration test" after all, it's mostly a test for the
DistSQLPlanner.Run()method. Let me see about finding another home.
moved
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/internal/client/txn.go, line 138 at r7 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
but I've also added protection in DistSQLServer, a layer above. So DistSQL can't get here with an aborted txn, and that should be the only case where this happened.
👍 just wanted to make sure you had thought about this.
pkg/sql/distsqlrun/server.go, line 330 at r12 (raw file):
if meta := req.TxnCoordMeta; meta != nil { if !localState.IsLocal { if meta.Txn.Status != roachpb.PENDING {
nit: Maybe leave a comment that this can be removed in 2.2, unless you think we should always keep it.
pkg/sql/integration_test.go, line 1 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
moved
Much better.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/sql/distsqlrun/server.go, line 330 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: Maybe leave a comment that this can be removed in 2.2, unless you think we should always keep it.
I think we should always keep it. It's pretty basic sanity and acts as documentation too.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/internal/client/txn.go, line 138 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
👍 just wanted to make sure you had thought about this.
btw, I considered having this return an error instead of crash, but didn't do it as the function is too commonly called and it's not worth introducing the error handling I thought
andreimatei
left a comment
There was a problem hiding this comment.
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained
29455: distsqlrun,kv,client: don't create flows in aborted txns r=andreimatei a=andreimatei Before this patch, we had a race between a heartbeat find out (async) that a txn has been aborted and a client using the txn to create DistSQL flows. We could end up creating flows in aborted txns, which are going to have a bad time: at the moment, leaf TxnCoordSenders don't react well to what appears to be an async abort: tcs.maybeRejectClientLocked() returns a HandledRetryableError, but the DistSQL infrastructure only expects raw retriable errors - so the HandledRetryableError was ironically losing its "retryable" character and was making it to a sql client with the wrong pgwire code. This patch resolves the situation by atomically checking that the txn is still good (i.e. status==PENDING) before extracting its metadata that is sent to leaves. Fixes #28898 Fixes #29271 Release note: Fix an issue where, under severe load, clients were sometimes receiving retryable errors with a non-retryable error code (a client would get an error with the message "HandledRetryableError: ..." but an internal error code instead of the expected retryable error code). Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Build succeeded |
Before this patch, we had a race between a heartbeat find out (async)
that a txn has been aborted and a client using the txn to create DistSQL
flows. We could end up creating flows in aborted txns, which are going
to have a bad time: at the moment, leaf TxnCoordSenders don't react well
to what appears to be an async abort: tcs.maybeRejectClientLocked()
returns a HandledRetryableError, but the DistSQL infrastructure only
expects raw retriable errors - so the HandledRetryableError was
ironically losing its "retryable" character and was making it to a
sql client with the wrong pgwire code.
This patch resolves the situation by atomically checking that the txn is
still good (i.e. status==PENDING) before extracting its metadata that is
sent to leaves.
Fixes #28898
Fixes #29271
Release note: Fix an issue where, under severe load, clients were
sometimes receiving retryable errors with a non-retryable error code (a
client would get an error with the message "HandledRetryableError: ..."
but an internal error code instead of the expected retryable error
code).