Skip to content

sql: use a different error code for communication failure#41451

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:sql.conn-failure-code
Oct 18, 2019
Merged

sql: use a different error code for communication failure#41451
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:sql.conn-failure-code

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.

Fixes #31645

Release note: None

@andreimatei andreimatei requested review from bdarnell and knz October 8, 2019 21:07
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)

@knz
Copy link
Copy Markdown
Contributor

knz commented Oct 9, 2019

I'm OK with changing the code, but be mindful that all the XX codes are treated as "internal errors" and then:

  • cause a sentry report to be created
  • cause very verbose logging
  • cause the clients to receive a full stack trace in their error details

I suspect this is not what you're looking for? (I assume that distsql errors are transient and probably expected under production chaos)

edit: I was mistaken see below

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm is all this really what we want for the other XX errors that we added? RangeUnavailable, CCLRequired, CCLValidLicenseRequired, TransactionCommittedWithSchemaChangeFailure? It doesn't sound to me like the verbose logging, stack traces and sentry are what we want for any of these...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)

@knz
Copy link
Copy Markdown
Contributor

knz commented Oct 9, 2019

I just checked the code, I was mistaken. Only XX000 (pgcode.Internal) cause sentry reports and a detailed stack trace to be sent to the client.

Then XXUUU also separately get reported to telemetry.

What I do know however is most clients will categorize all XX... errors are "irrecoverable" and will not try to engage in shenanigans to recover from it in the session.

@knz
Copy link
Copy Markdown
Contributor

knz commented Oct 9, 2019

I still think you should create a regular code, if not in class 08 (conn exc), then in class 53 (resource exc).

@knz
Copy link
Copy Markdown
Contributor

knz commented Oct 9, 2019

After careful reading of section 24.1 in ISO/IEC 9075-2:2016(E), and related documentation sections in the pg docs and at https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html, we figured that:

  • both this new error and also the current RangeUnavailable error fall under class 58 - system errors.
  • to ensure that our codes don't overlap with pg, and understanding that pg only uses numeric codes or the letter P after the class prefix, we'll introduce prefix 58C for them.

Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (cockroachdb#31645). This patch switches to a new,
CRDB-specific error code in the "system" class.

For consistency, it moves the code for RangeUnavailable to the same
class.

The class used is class 58, used by both Postgres and DB2 for "system"
errors. In Postgres, these errors are external to the database (e.g.
IO). In DB2 (and now in CRDB), they're also internal (to the cluster).

Fixes cockroachdb#31645

Release note: None
@andreimatei andreimatei force-pushed the sql.conn-failure-code branch from 1a3fd7f to c22e03f Compare October 9, 2019 22:28
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long live 58C.
PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

craig bot pushed a commit that referenced this pull request Oct 17, 2019
33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei

Takes 10s. I've opened #33654 asking for an investigation.

Release note: None

41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei

They used to displayed in non-deterministic map iteration order.

Release note: None

41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei

The flag was not hooked up to anything.

Release justification: N/A

Release note: None

41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei

Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.

Fixes #31645

Release note: None

41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei

Elevate the discoverability of FormatRecordedSpans() by making it the
stringer for a Recording. Remove the inferior stringer I had previously
added.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 18, 2019

Build succeeded

@craig craig bot merged commit c22e03f into cockroachdb:master Oct 18, 2019
@andreimatei andreimatei deleted the sql.conn-failure-code branch October 21, 2019 20:19
nvb added a commit to nvb/cockroach that referenced this pull request Dec 17, 2019
This was moved from "08006" in cockroachdb#41451, not "XXC03".

Release note: None
nvb added a commit to nvb/rksql that referenced this pull request Dec 17, 2019
Fixes cockroachdb/cockroach#36981.
Fixes cockroachdb/cockroach#39618.
Fixes cockroachdb/cockroach#40552.
Fixes cockroachdb/cockroach#41735.

cockroachdb/cockroach#41451 switched two forms
of errors that can be thrown during chaos events over to a new error code
class - 58, internal system errors. This commit updates `pqConnectionError`
to consider this error code class as retry-worthy.
craig bot pushed a commit that referenced this pull request Dec 17, 2019
43145: storage/diskmap: remove SortedDiskMapIterator.{Key,Value} r=petermattis a=petermattis

Remove `SortedDiskMapIterator.{Key,Value}` as these accessors are
horribly slow due to performing an allocation on every call. Change the
existing uses of these methods to `Unsafe{Key,Value}` adding copying
when necessary. Most of the use cases were easy to fix, though note that
`diskRowIterator.Row()` contains a TODO because the removal of the
allocation there caused many test failures.

The `PebbleMapIteration` benchmarks still show a regression in
comparison to f5009c8. That regression
is entirely due to Pebble's new memtable sizing heuristics which start
with a small memtable size and dynamically grow the size up to the
configured limit. Adding a knob to disable that behavior for the
purposes of a benchmark does not seem worthwhile.

```
name                                     old time/op    new time/op    delta
RocksDBMapIteration/InputSize4096-16       1.18ms ± 3%    0.81ms ± 0%   -31.56%  (p=0.000 n=10+8)
RocksDBMapIteration/InputSize16384-16      5.83ms ± 1%    4.14ms ± 3%   -29.13%  (p=0.000 n=9+10)
RocksDBMapIteration/InputSize65536-16      24.8ms ± 1%    17.7ms ± 3%   -28.54%  (p=0.000 n=9+10)
RocksDBMapIteration/InputSize262144-16      137ms ± 0%     105ms ± 2%   -23.65%  (p=0.000 n=9+9)
RocksDBMapIteration/InputSize1048576-16     547ms ± 1%     430ms ± 1%   -21.44%  (p=0.000 n=8+9)
PebbleMapIteration/InputSize4096-16         594µs ± 1%     323µs ± 2%   -45.65%  (p=0.000 n=9+9)
PebbleMapIteration/InputSize16384-16       3.29ms ± 1%    2.15ms ± 1%   -34.70%  (p=0.000 n=10+9)
PebbleMapIteration/InputSize65536-16       16.0ms ± 7%    11.2ms ±11%   -30.26%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize262144-16      96.7ms ± 3%    76.5ms ± 5%   -20.88%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize1048576-16      267ms ± 0%     185ms ± 1%   -30.60%  (p=0.000 n=9+10)

name                                     old alloc/op   new alloc/op   delta
RocksDBMapIteration/InputSize4096-16        262kB ± 0%       0kB ± 0%   -99.97%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize16384-16      1.31MB ± 0%    0.00MB ± 0%   -99.99%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize65536-16      5.51MB ± 0%    0.00MB ± 3%  -100.00%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize262144-16     22.3MB ± 0%     0.0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize1048576-16    89.4MB ± 0%     0.0MB ± 0%  -100.00%  (p=0.000 n=10+9)
PebbleMapIteration/InputSize4096-16         263kB ± 0%       0kB ± 0%   -99.91%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize16384-16       1.31MB ± 0%    0.00MB ± 0%   -99.98%  (p=0.000 n=10+8)
PebbleMapIteration/InputSize65536-16       5.50MB ± 0%    0.00MB ± 3%   -99.99%  (p=0.000 n=10+9)
PebbleMapIteration/InputSize262144-16      22.3MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=10+7)
PebbleMapIteration/InputSize1048576-16     89.3MB ± 0%     0.0MB ±26%  -100.00%  (p=0.000 n=10+10)

name                                     old allocs/op  new allocs/op  delta
RocksDBMapIteration/InputSize4096-16        8.20k ± 0%     0.00k ± 0%   -99.96%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize16384-16       41.0k ± 0%      0.0k ± 0%   -99.99%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize65536-16        172k ± 0%        0k ± 0%  -100.00%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize262144-16       696k ± 0%        0k ± 0%  -100.00%  (p=0.000 n=10+10)
RocksDBMapIteration/InputSize1048576-16     2.79M ± 0%     0.00M ± 0%  -100.00%  (p=0.000 n=9+9)
PebbleMapIteration/InputSize4096-16         8.20k ± 0%     0.01k ± 0%   -99.94%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize16384-16        41.0k ± 0%      0.0k ± 0%   -99.99%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize65536-16         172k ± 0%        0k ± 0%  -100.00%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize262144-16        696k ± 0%        0k ± 0%  -100.00%  (p=0.000 n=10+10)
PebbleMapIteration/InputSize1048576-16      2.79M ± 0%     0.00M ± 9%  -100.00%  (p=0.000 n=10+10)
```

Release note: None

43207: roachprod/gce: use 2 SSDs for c2 machine types r=nvanbenschoten a=nvanbenschoten

This is required to spin up c2 instances, just as it is to spin up
n2 instances.

Release note: None

43214: sql: stop swallowing errors from UncachedPhysicalAccessor.IsValidSchema r=andreimatei a=ajwerner

Before this commit we'd swallow retriable errors during execution. This can
be very problematic as it can lead to lost writes and other general funkiness.
We're opting to not write a test for this specific case as there is ongoing
work to change interfaces to preclude this sort of bug.

Fixes #43067
Fixes #37883 (comment)

Release note: None

43219: pgwire/pgcode: fix DeprecatedInternalConnectionFailure r=nvanbenschoten a=nvanbenschoten

This was moved from "08006" in #41451, not "XXC03".

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLState(08006) error causing HTTP 500 error upstream

4 participants