rpc: avoid network IO in `Dialer.ConnHealth` by erikgrinaker · Pull Request #70017 · cockroachdb/cockroach

erikgrinaker · 2021-09-10T11:54:33Z

Dialer.ConnHealth is used to check whether a healthy RPC connection
exists to a given node, in order to avoid interacting with unavailable
nodes. However, this actually attempted to dial the node if no
connection was found, which can block for tens of seconds in the case of
an unresponsive node. This is problematic since it is used in
performance-critical code paths, including Raft command application.

This patch changes Dialer.ConnHealth to avoid dialing the node, and
adds a Context.ConnHealth helper with access to the RPC connection
registry. Because DistSQLPlanner relied on ConnHealth to dial the
remote node, it also adds Dialer.ConnHealthTryDial which retains
the old behavior for use in DistSQL until a better solution can be
implemented.

Resolves #69888.

Release note (bug fix): Avoid dialing nodes in performance-critical code
paths, which could cause substantial latency when encountering
unresponsive nodes (e.g. when a VM or server is shut down).

cockroach-teamcity · 2021-09-10T11:54:39Z

This change is

knz

Thank you!

erikgrinaker · 2021-09-10T14:47:11Z

@cockroachdb/sql-queries This change causes TestDistSQLRangeCachesIntegrationTest to fail: it expects the DistSQL plan to cover 4 nodes, but because there is no RPC connection to nodes 2 and 3, ConnHealth() fails and the nodes are not included in the plan. It seems like the DistSQL planner implicitly relies on ConnHealth() attempting to dial nodes, which is no longer the case with this change. A few options:

Change distSQLNodeHealth.check or DistSQLPlanner to explicitly attempt to dial remote nodes every time it plans a DistSQL flow. This can introduce arbitrary latency up to tens of seconds if a node is unresponsive (which is the motivation for this change), although it would respect the circuit breaker and thus only occasionally attempt to dial nodes that are known to be unavailable.
Rely on other components maintaining RPC connections, i.e. Raft. This would require us to change this test such that we ensure there are Raft interactions between all nodes before running the DistSQL flows.
Implement some new, separate component that continually attempts to maintain RPC connections to all nodes. This is related to @tbg's proposal in rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419 (comment).
Paint over the problem by making TestCluster.Start() explicitly connect all cluster nodes at the start, and hope that production clusters will somehow maintain RPC connections.

yuzefovich · 2021-09-10T17:03:10Z

With regards to making TestDistSQLRangeCachesIntegrationTest pass, I think I'd be happy with any option. To me somehow modifying the test a bit so that those RPC connections are established sounds the easiest and least invasive, so I vote for option 2 if we're happy with the new behavior of the DistSQL planner.

Assuming we keep the new behavior of the DistSQL planner, I'm curious about the behavior in the following scenario: imagine we have a 3 node cluster with a table t with multiple ranges, and each node is the leaseholder for some of these ranges; now, if we execute SELECT * FROM t with node 1 as the gateway, and the DistSQL planner considers nodes 2 and 3 as unhealthy (because there are no RPC connections yet - how is this possible? did node 1 just restart?), we will plan a single TableReader on the gateway, and it will be performing the remote reads (let's assume the range cache is up to date). We will still need to set up the RPC connections to each of the remote nodes, right? If the range cache is not up to date, the DistSender will eventually try reading from the remote nodes, and it'll still need to set up the RPC connection, right?

If my logic is correct, it looks like by not establishing the RPC connections during the DistSQL physical planning, we're simply delaying that in this scenario.

tbg · 2021-09-10T19:20:32Z

Just popping in here without all of the context (apologies) but did we do an audit that makes sure that `ConnHealth` prompting a dial isn't used anywhere else? Should we be conservative and put a non-blocking singleflight dial attempt into ConnHealth to avoid any problems? Or are we confident this isn't an issue?

erikgrinaker · 2021-09-13T09:38:40Z

To me somehow modifying the test a bit so that those RPC connections are established sounds the easiest and least invasive, so I vote for option 2 if we're happy with the new behavior of the DistSQL planner.

Yeah, so the problem with this is that we're not guaranteed that other components will actually maintain RPC connections. It may fix the test, but end up breaking production workloads in some corner cases.

I think the most conservative option would be 1, which is equivalent to what the current code already does. Long term, I think we want to do some variant of 3.

Assuming we keep the new behavior of the DistSQL planner, I'm curious about the behavior in the following scenario: imagine we have a 3 node cluster with a table t with multiple ranges, and each node is the leaseholder for some of these ranges; now, if we execute SELECT * FROM t with node 1 as the gateway, and the DistSQL planner considers nodes 2 and 3 as unhealthy (because there are no RPC connections yet - how is this possible? did node 1 just restart?), we will plan a single TableReader on the gateway, and it will be performing the remote reads (let's assume the range cache is up to date). We will still need to set up the RPC connections to each of the remote nodes, right? If the range cache is not up to date, the DistSender will eventually try reading from the remote nodes, and it'll still need to set up the RPC connection, right?

I believe that's correct. The DistSender gRPC transport does consider ConnHealth before sending batches to replicas, but it only uses the result to order the replicas such that known-healthy replicas are tried first, and then does an explicit node dial for each batch when sending. I think this should then also apply to the range cache, which uses the DistSender as a source of range data.

did we do an audit that makes sure that ConnHealth prompting a dial isn't used anywhere else?

Not fully, I was going to rely on a first CI pass to find any obvious regressions. I did a quick pass now:

Raft: Uses ConnHealth in the tick loop and proposal quota. The Raft transport will explicitly dial remote nodes when setting up an outbound queue, and the queue is set up (if missing) every time a message is sent.
DistSender: Uses ConnHealth in the grpcTransport to order replicas (healthy before unhealthy), but will try all replicas in order and explicitly dials each one when sending RPCs.
DistSQL: Uses ConnHealth during planning. As we see here, it relies on this to also dial, and does not dial by itself during planning.

It appears to me like DistSQL is the odd one out here, the other components have mechanisms to maintain RPC connections. Either adding an explicit dial (to be conservative) or relying on DistSender to recover the connections should both be viable options, I think I'd prefer the explicit dial for now.

Should we be conservative and put a non-blocking singleflight dial attempt into ConnHealth to avoid any problems? Or are we confident this isn't an issue?

I see that Context.grpcDialNodeInternal() uses a combination of sync.Map and sync.Once to only have one dial in flight per node/class, so I suppose that would be a viable hack until we can clean this up. It doesn't appear like it should be necessary given the above though.

I'm not at all confident that this isn't an issue though, since the RPC dial story is pretty messy, so I'd have to rely on tests to pick up any corner cases.

erikgrinaker · 2021-09-13T09:54:27Z

Opened #70111 to track a new component that maintains RPC connections (i.e. option 3 in #70017 (comment)).

yuzefovich · 2021-09-14T03:37:37Z

I see, thanks for all the context Erik. I agree with your assessment that option 1 seems like the safest and easiest choice for now.

erikgrinaker · 2021-09-20T12:08:08Z

Added ConnHealthTryDial() which retains the old behavior, and used it in DistSQLPlanner. PTAL.

knz

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/rpc/nodedialer/nodedialer.go, line 230 at r2 (raw file):

// down), and should be avoided in latency-sensitive code paths. Preferably,
// this should be replaced entirely by some other mechanism to maintain RPC
// connections such as https://github.com/cockroachdb/cockroach/issues/70111

nit (style):

// to maintain RPC connections.
// See also: https:// ...

tbg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)

pkg/rpc/context.go, line 689 at r3 (raw file):

// ConnHealth returns nil if we have an open connection of the request
// class to the given node that succeeded on its most recent heartbeat.

// Returns ErrNoConnection if no connection to the node currently exists.

`Dialer.ConnHealth` is used to check whether a healthy RPC connection exists to a given node, in order to avoid interacting with unavailable nodes. However, this actually attempted to dial the node if no connection was found, which can block for tens of seconds in the case of an unresponsive node. This is problematic since it is used in performance-critical code paths, including Raft command application. This patch changes `Dialer.ConnHealth` to avoid dialing the node, and adds a `Context.ConnHealth` helper with access to the RPC connection registry. Because `DistSQLPlanner` relied on `ConnHealth` to dial the remote node, it also adds `Dialer.ConnHealthTryDial` which retains the old behavior for use in DistSQL until a better solution can be implemented. Release note (bug fix): Avoid dialing nodes in performance-critical code paths, which could cause substantial latency when encountering unresponsive nodes (e.g. when a VM or server is shut down).

erikgrinaker · 2021-09-21T10:51:20Z

bors r=knz,tbg

craig · 2021-09-21T11:50:36Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker requested a review from a team September 10, 2021 11:54

erikgrinaker self-assigned this Sep 10, 2021

erikgrinaker force-pushed the connhealth-nodial branch from 07bf1d8 to f22d265 Compare September 10, 2021 13:58

knz approved these changes Sep 10, 2021

View reviewed changes

erikgrinaker requested review from a team September 10, 2021 14:48

erikgrinaker mentioned this pull request Sep 13, 2021

rpc: automatically maintain RPC connections across cluster #70111

Open

erikgrinaker force-pushed the connhealth-nodial branch from f22d265 to 6f5a95e Compare September 20, 2021 12:07

erikgrinaker requested a review from tbg September 20, 2021 12:07

knz approved these changes Sep 20, 2021

View reviewed changes

erikgrinaker force-pushed the connhealth-nodial branch from 6f5a95e to 1c8934a Compare September 20, 2021 12:29

tbg requested a review from knz September 20, 2021 12:34

tbg approved these changes Sep 20, 2021

View reviewed changes

erikgrinaker force-pushed the connhealth-nodial branch 2 times, most recently from deb60be to ff37511 Compare September 20, 2021 16:14

erikgrinaker force-pushed the connhealth-nodial branch from ff37511 to cee675a Compare September 21, 2021 08:38

erikgrinaker added backport-20.2.x labels Sep 21, 2021

craig bot merged commit 73ae236 into cockroachdb:master Sep 21, 2021

This was referenced Sep 21, 2021

release-20.2: rpc: avoid network IO in Dialer.ConnHealth #70487

Merged

release-21.1: rpc: avoid network IO in Dialer.ConnHealth #70488

Merged

release-21.2: rpc: avoid network IO in Dialer.ConnHealth #70489

Merged

erikgrinaker deleted the connhealth-nodial branch September 22, 2021 14:27

Conversation

erikgrinaker commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Sep 10, 2021

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuzefovich commented Sep 10, 2021

Uh oh!

tbg commented Sep 10, 2021 via email

Uh oh!

erikgrinaker commented Sep 13, 2021

Uh oh!

erikgrinaker commented Sep 13, 2021

Uh oh!

yuzefovich commented Sep 14, 2021

Uh oh!

erikgrinaker commented Sep 20, 2021

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Sep 21, 2021

Uh oh!

craig bot commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

erikgrinaker commented Sep 10, 2021 •

edited

Loading

erikgrinaker commented Sep 10, 2021 •

edited

Loading