sql: use a different error code for communication failure by andreimatei · Pull Request #41451 · cockroachdb/cockroach

andreimatei · 2019-10-08T21:07:30Z

Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.

Fixes #31645

Release note: None

cockroach-teamcity · 2019-10-08T21:07:37Z

This change is

bdarnell

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

knz · 2019-10-09T13:33:52Z

I'm OK with changing the code, but be mindful that all the XX codes are treated as "internal errors" and then:

cause a sentry report to be created
cause very verbose logging
cause the clients to receive a full stack trace in their error details

I suspect this is not what you're looking for? (I assume that distsql errors are transient and probably expected under production chaos)

edit: I was mistaken see below

andreimatei

Hmmm is all this really what we want for the other XX errors that we added? RangeUnavailable, CCLRequired, CCLValidLicenseRequired, TransactionCommittedWithSchemaChangeFailure? It doesn't sound to me like the verbose logging, stack traces and sentry are what we want for any of these...

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

knz · 2019-10-09T16:48:21Z

I just checked the code, I was mistaken. Only XX000 (pgcode.Internal) cause sentry reports and a detailed stack trace to be sent to the client.

Then XXUUU also separately get reported to telemetry.

What I do know however is most clients will categorize all XX... errors are "irrecoverable" and will not try to engage in shenanigans to recover from it in the session.

knz · 2019-10-09T16:50:58Z

I still think you should create a regular code, if not in class 08 (conn exc), then in class 53 (resource exc).

knz · 2019-10-09T21:54:03Z

After careful reading of section 24.1 in ISO/IEC 9075-2:2016(E), and related documentation sections in the pg docs and at https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html, we figured that:

both this new error and also the current RangeUnavailable error fall under class 58 - system errors.
to ensure that our codes don't overlap with pg, and understanding that pg only uses numeric codes or the letter P after the class prefix, we'll introduce prefix 58C for them.

Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "system" class. For consistency, it moves the code for RangeUnavailable to the same class. The class used is class 58, used by both Postgres and DB2 for "system" errors. In Postgres, these errors are external to the database (e.g. IO). In DB2 (and now in CRDB), they're also internal (to the cluster). Fixes cockroachdb#31645 Release note: None

andreimatei

Long live 58C.
PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei Takes 10s. I've opened #33654 asking for an investigation. Release note: None 41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei They used to displayed in non-deterministic map iteration order. Release note: None 41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei The flag was not hooked up to anything. Release justification: N/A Release note: None 41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes #31645 Release note: None 41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei Elevate the discoverability of FormatRecordedSpans() by making it the stringer for a Recording. Remove the inferior stringer I had previously added. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

craig · 2019-10-18T00:34:34Z

Build succeeded

GitHub CI (Cockroach)

This was moved from "08006" in cockroachdb#41451, not "XXC03". Release note: None

Fixes cockroachdb/cockroach#36981. Fixes cockroachdb/cockroach#39618. Fixes cockroachdb/cockroach#40552. Fixes cockroachdb/cockroach#41735. cockroachdb/cockroach#41451 switched two forms of errors that can be thrown during chaos events over to a new error code class - 58, internal system errors. This commit updates `pqConnectionError` to consider this error code class as retry-worthy.

43145: storage/diskmap: remove SortedDiskMapIterator.{Key,Value} r=petermattis a=petermattis Remove `SortedDiskMapIterator.{Key,Value}` as these accessors are horribly slow due to performing an allocation on every call. Change the existing uses of these methods to `Unsafe{Key,Value}` adding copying when necessary. Most of the use cases were easy to fix, though note that `diskRowIterator.Row()` contains a TODO because the removal of the allocation there caused many test failures. The `PebbleMapIteration` benchmarks still show a regression in comparison to f5009c8. That regression is entirely due to Pebble's new memtable sizing heuristics which start with a small memtable size and dynamically grow the size up to the configured limit. Adding a knob to disable that behavior for the purposes of a benchmark does not seem worthwhile. ``` name old time/op new time/op delta RocksDBMapIteration/InputSize4096-16 1.18ms ± 3% 0.81ms ± 0% -31.56% (p=0.000 n=10+8) RocksDBMapIteration/InputSize16384-16 5.83ms ± 1% 4.14ms ± 3% -29.13% (p=0.000 n=9+10) RocksDBMapIteration/InputSize65536-16 24.8ms ± 1% 17.7ms ± 3% -28.54% (p=0.000 n=9+10) RocksDBMapIteration/InputSize262144-16 137ms ± 0% 105ms ± 2% -23.65% (p=0.000 n=9+9) RocksDBMapIteration/InputSize1048576-16 547ms ± 1% 430ms ± 1% -21.44% (p=0.000 n=8+9) PebbleMapIteration/InputSize4096-16 594µs ± 1% 323µs ± 2% -45.65% (p=0.000 n=9+9) PebbleMapIteration/InputSize16384-16 3.29ms ± 1% 2.15ms ± 1% -34.70% (p=0.000 n=10+9) PebbleMapIteration/InputSize65536-16 16.0ms ± 7% 11.2ms ±11% -30.26% (p=0.000 n=10+10) PebbleMapIteration/InputSize262144-16 96.7ms ± 3% 76.5ms ± 5% -20.88% (p=0.000 n=10+10) PebbleMapIteration/InputSize1048576-16 267ms ± 0% 185ms ± 1% -30.60% (p=0.000 n=9+10) name old alloc/op new alloc/op delta RocksDBMapIteration/InputSize4096-16 262kB ± 0% 0kB ± 0% -99.97% (p=0.000 n=10+10) RocksDBMapIteration/InputSize16384-16 1.31MB ± 0% 0.00MB ± 0% -99.99% (p=0.000 n=10+10) RocksDBMapIteration/InputSize65536-16 5.51MB ± 0% 0.00MB ± 3% -100.00% (p=0.000 n=10+10) RocksDBMapIteration/InputSize262144-16 22.3MB ± 0% 0.0MB ± 0% -100.00% (p=0.000 n=10+10) RocksDBMapIteration/InputSize1048576-16 89.4MB ± 0% 0.0MB ± 0% -100.00% (p=0.000 n=10+9) PebbleMapIteration/InputSize4096-16 263kB ± 0% 0kB ± 0% -99.91% (p=0.000 n=10+10) PebbleMapIteration/InputSize16384-16 1.31MB ± 0% 0.00MB ± 0% -99.98% (p=0.000 n=10+8) PebbleMapIteration/InputSize65536-16 5.50MB ± 0% 0.00MB ± 3% -99.99% (p=0.000 n=10+9) PebbleMapIteration/InputSize262144-16 22.3MB ± 0% 0.0MB ± 0% -99.99% (p=0.000 n=10+7) PebbleMapIteration/InputSize1048576-16 89.3MB ± 0% 0.0MB ±26% -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta RocksDBMapIteration/InputSize4096-16 8.20k ± 0% 0.00k ± 0% -99.96% (p=0.000 n=10+10) RocksDBMapIteration/InputSize16384-16 41.0k ± 0% 0.0k ± 0% -99.99% (p=0.000 n=10+10) RocksDBMapIteration/InputSize65536-16 172k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10) RocksDBMapIteration/InputSize262144-16 696k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10) RocksDBMapIteration/InputSize1048576-16 2.79M ± 0% 0.00M ± 0% -100.00% (p=0.000 n=9+9) PebbleMapIteration/InputSize4096-16 8.20k ± 0% 0.01k ± 0% -99.94% (p=0.000 n=10+10) PebbleMapIteration/InputSize16384-16 41.0k ± 0% 0.0k ± 0% -99.99% (p=0.000 n=10+10) PebbleMapIteration/InputSize65536-16 172k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10) PebbleMapIteration/InputSize262144-16 696k ± 0% 0k ± 0% -100.00% (p=0.000 n=10+10) PebbleMapIteration/InputSize1048576-16 2.79M ± 0% 0.00M ± 9% -100.00% (p=0.000 n=10+10) ``` Release note: None 43207: roachprod/gce: use 2 SSDs for c2 machine types r=nvanbenschoten a=nvanbenschoten This is required to spin up c2 instances, just as it is to spin up n2 instances. Release note: None 43214: sql: stop swallowing errors from UncachedPhysicalAccessor.IsValidSchema r=andreimatei a=ajwerner Before this commit we'd swallow retriable errors during execution. This can be very problematic as it can lead to lost writes and other general funkiness. We're opting to not write a test for this specific case as there is ongoing work to change interfaces to preclude this sort of bug. Fixes #43067 Fixes #37883 (comment) Release note: None 43219: pgwire/pgcode: fix DeprecatedInternalConnectionFailure r=nvanbenschoten a=nvanbenschoten This was moved from "08006" in #41451, not "XXC03". Release note: None Co-authored-by: Peter Mattis <petermattis@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

andreimatei requested review from bdarnell and knz October 8, 2019 21:07

bdarnell approved these changes Oct 8, 2019

View reviewed changes

andreimatei commented Oct 9, 2019

View reviewed changes

andreimatei force-pushed the sql.conn-failure-code branch from 1a3fd7f to c22e03f Compare October 9, 2019 22:28

andreimatei commented Oct 9, 2019

View reviewed changes

knz approved these changes Oct 10, 2019

View reviewed changes

andreimatei commented Oct 17, 2019

View reviewed changes

craig bot merged commit c22e03f into cockroachdb:master Oct 18, 2019

andreimatei deleted the sql.conn-failure-code branch October 21, 2019 20:19

nvb added a commit to nvb/cockroach that referenced this pull request Dec 17, 2019

pgwire/pgcode: fix DeprecatedInternalConnectionFailure

7227df4

This was moved from "08006" in cockroachdb#41451, not "XXC03". Release note: None

nvb mentioned this pull request Dec 17, 2019

pgwire/pgcode: fix DeprecatedInternalConnectionFailure #43219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: use a different error code for communication failure#41451

sql: use a different error code for communication failure#41451
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:sql.conn-failure-code

andreimatei commented Oct 8, 2019

Uh oh!

cockroach-teamcity commented Oct 8, 2019

Uh oh!

bdarnell left a comment

Uh oh!

knz commented Oct 9, 2019 •

edited

Loading

Uh oh!

andreimatei left a comment

Uh oh!

knz commented Oct 9, 2019 •

edited

Loading

Uh oh!

knz commented Oct 9, 2019

Uh oh!

knz commented Oct 9, 2019

Uh oh!

andreimatei left a comment

Uh oh!

andreimatei left a comment

Uh oh!

craig bot commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andreimatei commented Oct 8, 2019

Uh oh!

cockroach-teamcity commented Oct 8, 2019

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knz commented Oct 9, 2019

Uh oh!

knz commented Oct 9, 2019

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Oct 18, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knz commented Oct 9, 2019 •

edited

Loading

knz commented Oct 9, 2019 •

edited

Loading