sql: use a different error code for communication failure#41451
sql: use a different error code for communication failure#41451craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
bdarnell
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz)
|
I'm OK with changing the code, but be mindful that all the XX codes are treated as "internal errors" and then:
I suspect this is not what you're looking for? (I assume that distsql errors are transient and probably expected under production chaos) edit: I was mistaken see below |
andreimatei
left a comment
There was a problem hiding this comment.
Hmmm is all this really what we want for the other XX errors that we added? RangeUnavailable, CCLRequired, CCLValidLicenseRequired, TransactionCommittedWithSchemaChangeFailure? It doesn't sound to me like the verbose logging, stack traces and sentry are what we want for any of these...
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz)
|
I just checked the code, I was mistaken. Only XX000 (pgcode.Internal) cause sentry reports and a detailed stack trace to be sent to the client. Then XXUUU also separately get reported to telemetry. What I do know however is most clients will categorize all XX... errors are "irrecoverable" and will not try to engage in shenanigans to recover from it in the session. |
|
I still think you should create a regular code, if not in class 08 (conn exc), then in class 53 (resource exc). |
|
After careful reading of section 24.1 in ISO/IEC 9075-2:2016(E), and related documentation sections in the pg docs and at https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html, we figured that:
|
Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "system" class. For consistency, it moves the code for RangeUnavailable to the same class. The class used is class 58, used by both Postgres and DB2 for "system" errors. In Postgres, these errors are external to the database (e.g. IO). In DB2 (and now in CRDB), they're also internal (to the cluster). Fixes cockroachdb#31645 Release note: None
1a3fd7f to
c22e03f
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Long live 58C.
PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)
andreimatei
left a comment
There was a problem hiding this comment.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)
33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei Takes 10s. I've opened #33654 asking for an investigation. Release note: None 41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei They used to displayed in non-deterministic map iteration order. Release note: None 41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei The flag was not hooked up to anything. Release justification: N/A Release note: None 41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes #31645 Release note: None 41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei Elevate the discoverability of FormatRecordedSpans() by making it the stringer for a Recording. Remove the inferior stringer I had previously added. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Build succeeded |
This was moved from "08006" in cockroachdb#41451, not "XXC03". Release note: None
Fixes cockroachdb/cockroach#36981. Fixes cockroachdb/cockroach#39618. Fixes cockroachdb/cockroach#40552. Fixes cockroachdb/cockroach#41735. cockroachdb/cockroach#41451 switched two forms of errors that can be thrown during chaos events over to a new error code class - 58, internal system errors. This commit updates `pqConnectionError` to consider this error code class as retry-worthy.
43145: storage/diskmap: remove SortedDiskMapIterator.{Key,Value} r=petermattis a=petermattis
Remove `SortedDiskMapIterator.{Key,Value}` as these accessors are
horribly slow due to performing an allocation on every call. Change the
existing uses of these methods to `Unsafe{Key,Value}` adding copying
when necessary. Most of the use cases were easy to fix, though note that
`diskRowIterator.Row()` contains a TODO because the removal of the
allocation there caused many test failures.
The `PebbleMapIteration` benchmarks still show a regression in
comparison to f5009c8. That regression
is entirely due to Pebble's new memtable sizing heuristics which start
with a small memtable size and dynamically grow the size up to the
configured limit. Adding a knob to disable that behavior for the
purposes of a benchmark does not seem worthwhile.
```
name old time/op new time/op delta
RocksDBMapIteration/InputSize4096-16 1.18ms ± 3% 0.81ms ± 0% -31.56% (p=0.000 n=10+8)
RocksDBMapIteration/InputSize16384-16 5.83ms ± 1% 4.14ms ± 3% -29.13% (p=0.000 n=9+10)
RocksDBMapIteration/InputSize65536-16 24.8ms ± 1% 17.7ms ± 3% -28.54% (p=0.000 n=9+10)
RocksDBMapIteration/InputSize262144-16 137ms ± 0% 105ms ± 2% -23.65% (p=0.000 n=9+9)
RocksDBMapIteration/InputSize1048576-16 547ms ± 1% 430ms ± 1% -21.44% (p=0.000 n=8+9)
PebbleMapIteration/InputSize4096-16 594µs ± 1% 323µs ± 2% -45.65% (p=0.000 n=9+9)
PebbleMapIteration/InputSize16384-16 3.29ms ± 1% 2.15ms ± 1% -34.70% (p=0.000 n=10+9)
PebbleMapIteration/InputSize65536-16 16.0ms ± 7% 11.2ms ±11% -30.26% (p=0.000 n=10+10)
PebbleMapIteration/InputSize262144-16 96.7ms ± 3% 76.5ms ± 5% -20.88% (p=0.000 n=10+10)
PebbleMapIteration/InputSize1048576-16 267ms ± 0% 185ms ± 1% -30.60% (p=0.000 n=9+10)
name old alloc/op new alloc/op delta
RocksDBMapIteration/InputSize4096-16 262kB ± 0% 0kB ± 0% -99.97% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize16384-16 1.31MB ± 0% 0.00MB ± 0% -99.99% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize65536-16 5.51MB ± 0% 0.00MB ± 3% -100.00% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize262144-16 22.3MB ± 0% 0.0MB ± 0% -100.00% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize1048576-16 89.4MB ± 0% 0.0MB ± 0% -100.00% (p=0.000 n=10+9)
PebbleMapIteration/InputSize4096-16 263kB ± 0% 0kB ± 0% -99.91% (p=0.000 n=10+10)
PebbleMapIteration/InputSize16384-16 1.31MB ± 0% 0.00MB ± 0% -99.98% (p=0.000 n=10+8)
PebbleMapIteration/InputSize65536-16 5.50MB ± 0% 0.00MB ± 3% -99.99% (p=0.000 n=10+9)
PebbleMapIteration/InputSize262144-16 22.3MB ± 0% 0.0MB ± 0% -99.99% (p=0.000 n=10+7)
PebbleMapIteration/InputSize1048576-16 89.3MB ± 0% 0.0MB ±26% -100.00% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
RocksDBMapIteration/InputSize4096-16 8.20k ± 0% 0.00k ± 0% -99.96% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize16384-16 41.0k ± 0% 0.0k ± 0% -99.99% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize65536-16 172k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize262144-16 696k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10)
RocksDBMapIteration/InputSize1048576-16 2.79M ± 0% 0.00M ± 0% -100.00% (p=0.000 n=9+9)
PebbleMapIteration/InputSize4096-16 8.20k ± 0% 0.01k ± 0% -99.94% (p=0.000 n=10+10)
PebbleMapIteration/InputSize16384-16 41.0k ± 0% 0.0k ± 0% -99.99% (p=0.000 n=10+10)
PebbleMapIteration/InputSize65536-16 172k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10)
PebbleMapIteration/InputSize262144-16 696k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10)
PebbleMapIteration/InputSize1048576-16 2.79M ± 0% 0.00M ± 9% -100.00% (p=0.000 n=10+10)
```
Release note: None
43207: roachprod/gce: use 2 SSDs for c2 machine types r=nvanbenschoten a=nvanbenschoten
This is required to spin up c2 instances, just as it is to spin up
n2 instances.
Release note: None
43214: sql: stop swallowing errors from UncachedPhysicalAccessor.IsValidSchema r=andreimatei a=ajwerner
Before this commit we'd swallow retriable errors during execution. This can
be very problematic as it can lead to lost writes and other general funkiness.
We're opting to not write a test for this specific case as there is ongoing
work to change interfaces to preclude this sort of bug.
Fixes #43067
Fixes #37883 (comment)
Release note: None
43219: pgwire/pgcode: fix DeprecatedInternalConnectionFailure r=nvanbenschoten a=nvanbenschoten
This was moved from "08006" in #41451, not "XXC03".
Release note: None
Co-authored-by: Peter Mattis <petermattis@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.
Fixes #31645
Release note: None