rpc: retain information about failed connections by tbg · Pull Request #99191 · cockroachdb/cockroach

tbg · 2023-03-22T07:49:48Z

Prior to this change, rpc.Context did not remember a connection after it
failed. A connection attempt would be created whenever we didn't have a
functioning connection to the given (nodeID, targetAddress, class) and
callers would be multiplexed onto it as long as that attempt was not known to
have failed.

The main problem with this setup was that it makes it difficult to provide good
observability because a remote node that just hasn't been dialed yet and one
that is down but between attempts looks the same. We could write code that
periodically tries to fully connect all nodes in the cluster to each other, but
even then exporting good metrics is challenging and while we're currently
comfortable with a densely connected cluster, that may change over time.

An additional challenge was presented by circuit breaking and logging of
attempts. Without state retained, we were limited to a simple retry scheme with
lots of redundant logging. It wasn't easy to log how long a connection had been
unhealthy (as an unhealthy connection was represented as an absent connection),
so in effect folks had to trawl through logs to grab the corresponding
timestamps of first and last failure.

Another piece of complexity were the RPC circuit breakers. These were
implemented at the NodeDialer-level (i.e. one circuit breaker per
(NodeID,Class)) but kept in the rpc.Context (which generally operates on
(NodeID, Addr, Class) because gossip also used them. The library they were
using uses recruitment of client traffic, which also isn't ideal as it could
periodically subject a SQL query to a timeout failure. We'd really prefer it
if the breakers probed without user traffic recruitment.

This PR addresses the above shortcomings:

state is now retained across restarts on a (NodeID, Addr, Class) basis in
the `rpc.Context. This has a few complications, for example we need to handle
decommissioned nodes, as well as nodes that restart under a new listening
address.
the NodeDialer-level circuit breakers are removed.
we're no longer recruiting user traffic for probes. Instead, we adopt the
util/circuit package already used for Replica-level circuit breaking. This
library uses an async background probe to determine when to heal the breaker.

I also think the amount of incidental complexity in the rpc package is greatly
reduced as a result of this work.

I ran the failover/non-system/ suite of tests to validate this change. The results
look good¹.

Epic: CRDB-21710
Release note (ui change): a "Networking" metrics dashboard was added.
Release note (ops change): CockroachDB now exposes the following metrics:

rpc.connection.{un,}healthy: Gauge of healthy/unhealthy connections,
rpc.connection.{un,}healthy_nanos: Gauge of aggregate duration the currently {un,}healthy connections have been {,un}healthy for,
grpc.connection.avg_round_trip_latency: sum of weighted moving averages of round-trip latencies for all peers (it really only makes sense to consume these on a per-peer basis via prometheus)
Counters rpc.connection.heartbeats and rpc.connection.failures.
When the server.child_metrics.enabled cluster setting is set, these metrics export per-peer values, meaning they can be used to derive a connectivity graph.
Release note (ops change): logging for intra-node RPC connection failures has improved by reducing frequently and improving information content.

https://gist.github.com/tbg/5f5c716fa2fe5a21eca4a0c3a4a35069 ↩

cockroach-teamcity · 2023-03-22T07:50:27Z

This change is

103582: server: call VerifyClockOffset directly r=aliher1911 a=tbg Extracted from #99191. ---- There was this weird setup where we were hiding it in a callback. Let's just call it directly. This was masking that fatalling on the result is actually turned off except when starting a `Server` (i.e. outside of pure unit tests). Now there's an explicit variable on the options to opt into the fatal error. Now we also call it only when we want to: we didn't actually want to call it when dealing with blocking requests. It didn't hurt either since the blocking requests didn't change the state of the tracker, but still - now we're only checking when something has changed which is better. Epic: None Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

103387: grpcutil: improve gRPC logging r=knz a=tbg Extracted from #99191, where I ran a bunch of experiments to make sure the logging around connection failures is clean and helpful. I did manual testing on the changes in this PR. It's tricky to test vmodule-depth-thingies programmatically and I don't want to spend time building new infra here. - cli: don't lower gRPC severity threshold - grpcutil: adopt DepthLoggerV2 - sidetransport: suppress logging when peer down Release note: None Epic: CRDB-21710 103492: kv: add assertion to prevent lock re-acquisition at prior epochs r=nvanbenschoten a=arulajmani Lock re-acquisitions from requests that belong to prior epochs is not allowed. For replicated locks, mvccPutInternal rejects such requests with an error. However, for unreplicated locks where the in-memory lock table is the source of truth, no such check happens. This patch adds one. The comment on `lockHolderInfo.txn` field suggests this was always the intended behavior; it was just never enforced. While here, we also add an assertion to ensure a replicated lock acquisition never calls into the lock table with a transaction belonging to a prior epoch. We expect this case to be handled in mvccPutInternal, so unlike the unreplicated case, this deserves to be an assertion failure. This patch also updates a test that unintentionally re-acquired a lock using a txn from an older epoch. We also add a new test. Informs: #102269 Release note: None 103533: allocator2: fix normalization for the no constraint case r=kvoli a=sumeerbhola Informs #103320 Epic: CRDB-25222 Release note: None 103579: server: add comment pointing out TestGraphite slowness r=aliher1911 a=tbg Epic: none Release note: None 103581: server,rpc,circuit: assortment of small cleanups/improvements r=knz a=tbg Extracted from #99191. - server: fix a (harmless) test buglet - rpc: small comment tweak - rpc: fix incorrect comment - circuit: provide type for the anonymous Signal interface - rpc: rename a package - rpc: minor Goland lint appeasement - rpc: avoid shadowing `backoff` package - rpc: make some unused args anonymous - rpc: make it clear that TestStoreMetrics restarts stores Release note: None Epic: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

tbg · 2023-06-07T11:42:28Z

bors r=aliher1911

craig · 2023-06-07T12:21:31Z

Build succeeded:

Bazel Essential CI (Cockroach)

yuzefovich · 2023-06-14T17:06:12Z

Hey @tbg, there was a CI failure on staging that seems related to this change (one of the suspicious goroutines):

goroutine 67695 [select, 14 minutes]:
github.com/cockroachdb/cockroach/pkg/rpc.(*Connection).waitOrDefault(0xc01008d400, {0x7774070, 0xc00d06c930}, {0x0, 0x0}, {0x7745330, 0xb1b22a8})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/connection.go:79 +0x305
github.com/cockroachdb/cockroach/pkg/rpc.(*Connection).ConnectNoBreaker(0xc01008d400, {0x7774070, 0xc00d06c930})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/connection.go:145 +0x5e
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial(0xc006b701f8, {0x7774070, 0xc00d06c930}, 0x33333333?, {0x7747798?, 0xc00d8d5620}, 0x0, 0x0?)
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:175 +0x10f
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).DialNoBreaker(0xc006b701f8, {0x7774070, 0xc00d06c930}, 0x3b9aca00?, 0x0?)
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:119 +0x8d
github.com/cockroachdb/cockroach/pkg/sql/execinfra.GetConnForOutbox({0x7774070, 0xc00d06c930}, {0x7728d20, 0xc006b701f8}, 0x5d35d80?, 0x2540be400)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/outboxbase.go:47 +0x215
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run.func1({0x7774070, 0xc00d06c930}, {0x7728d20?, 0xc006b701f8?}, 0x1?, 0x0?, {0x7774070, 0xc00d06c660}, 0xc01588bea0, 0xc010d45860, ...)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:187 +0x65
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc010d45860, {0x7774070, 0xc00d06c660}, {0x7728d20, 0xc006b701f8}, 0x6459e8?, 0x0, 0xc0064528d0, 0xc7d098?)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:228 +0x336
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1({0x7774070, 0xc00d06c660}, 0xc00b6f16c0?)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:768 +0xac
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).accumulateAsyncComponent.func1.1()
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:730 +0x6b
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).accumulateAsyncComponent.func1
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:728 +0xdd

No pass/skip/fail event found for test
=== RUN   TestTenantLogic_json_builtins
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/ccb100e0ba6b8b3b9d3e912d839f875a/logTestTenantLogic_json_builtins1377825662
    test_log_scope.go:81: use -show-logs to present logs inline
[16:44:09] setting distsql_workmem='29116B';
[16:44:09] setting distsql_workmem='29116B';
[16:44:10] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3292/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/json_builtins with config 3node-tenant: 288 tests, 0 failures
*
* INFO: slow quiesce. stack traces:
* goroutine 67277 [running]:

tbg · 2023-06-14T17:28:49Z

Thanks I'll take a look tomorrow!

tbg · 2023-06-15T09:41:55Z

This should hopefully fix it once merged: #104945

data-matt · 2023-07-27T12:55:06Z

Folks, any chance of backporting this flag to 22.2? thanks

tbg · 2023-07-27T13:04:25Z

No, this is a major change. We won't be able to backport it.

data-matt · 2023-08-21T10:04:14Z

@tbg , the cluster setting server.child_metrics.enabled will be in 23.1+ ? I see references to 23.1 here.

Thanks cc @florence-crl

erikgrinaker · 2023-08-21T10:20:09Z

@tbg , the cluster setting server.child_metrics.enabled will be in 23.1+ ? I see references to 23.1 here.

The setting server.child_metrics.enabled has existed for years, and will enable additional Prometheus metrics for possibly-high-cardinality metrics. For example, changefeeds appear to use them to expose additional per-changefeed metrics.

However, prior to 23.2, this setting does not expose additional RPC metrics. In 23.2, it will also expose per-peer connection metrics as described in the release notes in this PR. This does not happen in 23.1, and can not since the necessary changes are too invasive.

The 23.1 reference you see here was for a minor bugfix that was extracted out of this PR, see #105626. It is not related to the functionality you're interested in.

data-matt · 2023-08-21T11:05:05Z

Thanks for the clear answer @erikgrinaker.

tbg force-pushed the rpc-track-failed-connections branch from 27b5b32 to 26cd800 Compare April 18, 2023 14:07

tbg force-pushed the rpc-track-failed-connections branch 4 times, most recently from ae2f39c to 33f44f7 Compare May 1, 2023 07:05

tbg force-pushed the rpc-track-failed-connections branch 6 times, most recently from 20f6390 to 3fa5742 Compare May 15, 2023 14:00

tbg mentioned this pull request May 15, 2023

rpc: track failed connections #96566

Closed

tbg force-pushed the rpc-track-failed-connections branch 2 times, most recently from 330f311 to 6301cc4 Compare May 16, 2023 07:38

tbg changed the title ~~[dnm] rpc: track failed connections~~ rpc: retain information about failed connections May 16, 2023

tbg force-pushed the rpc-track-failed-connections branch from 6301cc4 to 6a0ef1c Compare May 16, 2023 22:34

tbg force-pushed the rpc-track-failed-connections branch from 6a0ef1c to a13af72 Compare May 17, 2023 17:12

This was referenced May 17, 2023

grpcutil: improve gRPC logging #103387

Merged

server,rpc,circuit: assortment of small cleanups/improvements #103581

Merged

tbg force-pushed the rpc-track-failed-connections branch from a13af72 to 41d8862 Compare May 18, 2023 13:07

tbg mentioned this pull request May 18, 2023

server: call VerifyClockOffset directly #103582

Merged

tbg force-pushed the rpc-track-failed-connections branch from 41d8862 to cddee5c Compare May 19, 2023 19:20

This was referenced Jun 16, 2023

rpc: export metrics for latencies between peers #71193

Closed

nodedialer: spammy connection failure logging #73724

Closed

koorosh mentioned this pull request Jun 16, 2023

server: add NetworkConnectivity endpoint to get network connection statuses between nodes #95429

Merged

tbg mentioned this pull request Jun 16, 2023

Improve network detection diagnostics to prove network partitions exist between nodes #95378

Open

erikgrinaker mentioned this pull request Jun 22, 2023

roachtest: admission-control/index-backfill failed #105260

Closed

tbg mentioned this pull request Jun 22, 2023

rpc: panic in per-peer metrics #105335

Closed

joshimhoff mentioned this pull request Jun 23, 2023

kvprober: special case node-is-decommissioned errors #104365

Merged

koorosh mentioned this pull request Jun 26, 2023

rpc: intra-node latencies improperly maintained during partial network failures #58033

Closed

yuzefovich mentioned this pull request Jun 26, 2023

testutils: TestRestart flaked on master #104740

Closed

tbg mentioned this pull request Jun 27, 2023

release-23.1: rpc: simplify VerifyDialback via circuit breakers #105626

Merged

tbg mentioned this pull request Jul 17, 2023

cli: bump grpc log level from WARNING to ERROR #34165

Closed

andrewbaptist mentioned this pull request Jul 17, 2023

rpc: leaked "conn state watcher" goroutines on flapping RPC connections #106773

Closed

tbg mentioned this pull request Jul 27, 2023

rpc: avoid crash in newPeer #107697

Merged

yuzefovich mentioned this pull request Aug 10, 2023

rpc: panic in newPeer under stress #108499

Closed

renatolabs mentioned this pull request Aug 17, 2023

roachtest: jepsen/sets/majority-ring failed #104608

Closed

erikgrinaker mentioned this pull request Aug 23, 2023

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open

pav-kv mentioned this pull request Aug 23, 2023

roachtest: jepsen/multi-register/majority-ring-start-kill-2 failed #104602

Closed

erikgrinaker mentioned this pull request Sep 6, 2023

kvserver: improve severe packet loss handling #110099

Open

erikgrinaker mentioned this pull request Sep 19, 2023

*: remove use of cockroachdb/circuitbreaker package #110892

Merged

kvoli mentioned this pull request Jan 25, 2024

rpc: remove noisy neighbor test #118310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: retain information about failed connections#99191

rpc: retain information about failed connections#99191
craig[bot] merged 17 commits intocockroachdb:masterfrom
tbg:rpc-track-failed-connections

tbg commented Mar 22, 2023 •

edited

Loading

Uh oh!

cockroach-teamcity commented Mar 22, 2023

Uh oh!

tbg commented Jun 7, 2023

Uh oh!

craig bot commented Jun 7, 2023

Uh oh!

yuzefovich commented Jun 14, 2023

Uh oh!

tbg commented Jun 14, 2023

Uh oh!

tbg commented Jun 15, 2023

Uh oh!

data-matt commented Jul 27, 2023

Uh oh!

tbg commented Jul 27, 2023

Uh oh!

data-matt commented Aug 21, 2023

Uh oh!

erikgrinaker commented Aug 21, 2023

Uh oh!

data-matt commented Aug 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

tbg commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

cockroach-teamcity commented Mar 22, 2023

Uh oh!

tbg commented Jun 7, 2023

Uh oh!

craig bot commented Jun 7, 2023

Uh oh!

yuzefovich commented Jun 14, 2023

Uh oh!

tbg commented Jun 14, 2023

Uh oh!

tbg commented Jun 15, 2023

Uh oh!

data-matt commented Jul 27, 2023

Uh oh!

tbg commented Jul 27, 2023

Uh oh!

data-matt commented Aug 21, 2023

Uh oh!

erikgrinaker commented Aug 21, 2023

Uh oh!

data-matt commented Aug 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tbg commented Mar 22, 2023 •

edited

Loading