Skip to content

rpc: avoid network IO in Dialer.ConnHealth#70017

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:connhealth-nodial
Sep 21, 2021
Merged

rpc: avoid network IO in Dialer.ConnHealth#70017
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:connhealth-nodial

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Sep 10, 2021

Dialer.ConnHealth is used to check whether a healthy RPC connection
exists to a given node, in order to avoid interacting with unavailable
nodes. However, this actually attempted to dial the node if no
connection was found, which can block for tens of seconds in the case of
an unresponsive node. This is problematic since it is used in
performance-critical code paths, including Raft command application.

This patch changes Dialer.ConnHealth to avoid dialing the node, and
adds a Context.ConnHealth helper with access to the RPC connection
registry. Because DistSQLPlanner relied on ConnHealth to dial the
remote node, it also adds Dialer.ConnHealthTryDial which retains
the old behavior for use in DistSQL until a better solution can be
implemented.

Resolves #69888.

Release note (bug fix): Avoid dialing nodes in performance-critical code
paths, which could cause substantial latency when encountering
unresponsive nodes (e.g. when a VM or server is shut down).

@erikgrinaker erikgrinaker requested a review from a team September 10, 2021 11:54
@erikgrinaker erikgrinaker self-assigned this Sep 10, 2021
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Sep 10, 2021

@cockroachdb/sql-queries This change causes TestDistSQLRangeCachesIntegrationTest to fail: it expects the DistSQL plan to cover 4 nodes, but because there is no RPC connection to nodes 2 and 3, ConnHealth() fails and the nodes are not included in the plan. It seems like the DistSQL planner implicitly relies on ConnHealth() attempting to dial nodes, which is no longer the case with this change. A few options:

  1. Change distSQLNodeHealth.check or DistSQLPlanner to explicitly attempt to dial remote nodes every time it plans a DistSQL flow. This can introduce arbitrary latency up to tens of seconds if a node is unresponsive (which is the motivation for this change), although it would respect the circuit breaker and thus only occasionally attempt to dial nodes that are known to be unavailable.

  2. Rely on other components maintaining RPC connections, i.e. Raft. This would require us to change this test such that we ensure there are Raft interactions between all nodes before running the DistSQL flows.

  3. Implement some new, separate component that continually attempts to maintain RPC connections to all nodes. This is related to @tbg's proposal in rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419 (comment).

  4. Paint over the problem by making TestCluster.Start() explicitly connect all cluster nodes at the start, and hope that production clusters will somehow maintain RPC connections.

@erikgrinaker erikgrinaker requested review from a team September 10, 2021 14:48
@yuzefovich
Copy link
Copy Markdown
Member

With regards to making TestDistSQLRangeCachesIntegrationTest pass, I think I'd be happy with any option. To me somehow modifying the test a bit so that those RPC connections are established sounds the easiest and least invasive, so I vote for option 2 if we're happy with the new behavior of the DistSQL planner.


Assuming we keep the new behavior of the DistSQL planner, I'm curious about the behavior in the following scenario: imagine we have a 3 node cluster with a table t with multiple ranges, and each node is the leaseholder for some of these ranges; now, if we execute SELECT * FROM t with node 1 as the gateway, and the DistSQL planner considers nodes 2 and 3 as unhealthy (because there are no RPC connections yet - how is this possible? did node 1 just restart?), we will plan a single TableReader on the gateway, and it will be performing the remote reads (let's assume the range cache is up to date). We will still need to set up the RPC connections to each of the remote nodes, right? If the range cache is not up to date, the DistSender will eventually try reading from the remote nodes, and it'll still need to set up the RPC connection, right?

If my logic is correct, it looks like by not establishing the RPC connections during the DistSQL physical planning, we're simply delaying that in this scenario.

@tbg
Copy link
Copy Markdown
Member

tbg commented Sep 10, 2021 via email

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

To me somehow modifying the test a bit so that those RPC connections are established sounds the easiest and least invasive, so I vote for option 2 if we're happy with the new behavior of the DistSQL planner.

Yeah, so the problem with this is that we're not guaranteed that other components will actually maintain RPC connections. It may fix the test, but end up breaking production workloads in some corner cases.

I think the most conservative option would be 1, which is equivalent to what the current code already does. Long term, I think we want to do some variant of 3.

Assuming we keep the new behavior of the DistSQL planner, I'm curious about the behavior in the following scenario: imagine we have a 3 node cluster with a table t with multiple ranges, and each node is the leaseholder for some of these ranges; now, if we execute SELECT * FROM t with node 1 as the gateway, and the DistSQL planner considers nodes 2 and 3 as unhealthy (because there are no RPC connections yet - how is this possible? did node 1 just restart?), we will plan a single TableReader on the gateway, and it will be performing the remote reads (let's assume the range cache is up to date). We will still need to set up the RPC connections to each of the remote nodes, right? If the range cache is not up to date, the DistSender will eventually try reading from the remote nodes, and it'll still need to set up the RPC connection, right?

I believe that's correct. The DistSender gRPC transport does consider ConnHealth before sending batches to replicas, but it only uses the result to order the replicas such that known-healthy replicas are tried first, and then does an explicit node dial for each batch when sending. I think this should then also apply to the range cache, which uses the DistSender as a source of range data.

did we do an audit that makes sure that ConnHealth prompting a dial isn't used anywhere else?

Not fully, I was going to rely on a first CI pass to find any obvious regressions. I did a quick pass now:

  • Raft: Uses ConnHealth in the tick loop and proposal quota. The Raft transport will explicitly dial remote nodes when setting up an outbound queue, and the queue is set up (if missing) every time a message is sent.
  • DistSender: Uses ConnHealth in the grpcTransport to order replicas (healthy before unhealthy), but will try all replicas in order and explicitly dials each one when sending RPCs.
  • DistSQL: Uses ConnHealth during planning. As we see here, it relies on this to also dial, and does not dial by itself during planning.

It appears to me like DistSQL is the odd one out here, the other components have mechanisms to maintain RPC connections. Either adding an explicit dial (to be conservative) or relying on DistSender to recover the connections should both be viable options, I think I'd prefer the explicit dial for now.

Should we be conservative and put a non-blocking singleflight dial attempt into ConnHealth to avoid any problems? Or are we confident this isn't an issue?

I see that Context.grpcDialNodeInternal() uses a combination of sync.Map and sync.Once to only have one dial in flight per node/class, so I suppose that would be a viable hack until we can clean this up. It doesn't appear like it should be necessary given the above though.

I'm not at all confident that this isn't an issue though, since the RPC dial story is pretty messy, so I'd have to rely on tests to pick up any corner cases.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Opened #70111 to track a new component that maintains RPC connections (i.e. option 3 in #70017 (comment)).

@yuzefovich
Copy link
Copy Markdown
Member

I see, thanks for all the context Erik. I agree with your assessment that option 1 seems like the safest and easiest choice for now.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Added ConnHealthTryDial() which retains the old behavior, and used it in DistSQLPlanner. PTAL.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 230 at r2 (raw file):

// down), and should be avoided in latency-sensitive code paths. Preferably,
// this should be replaced entirely by some other mechanism to maintain RPC
// connections such as https://github.com/cockroachdb/cockroach/issues/70111

nit (style):

// to maintain RPC connections.
// See also: https:// ...

@tbg tbg requested a review from knz September 20, 2021 12:34
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)


pkg/rpc/context.go, line 689 at r3 (raw file):

// ConnHealth returns nil if we have an open connection of the request
// class to the given node that succeeded on its most recent heartbeat.

// Returns ErrNoConnection if no connection to the node currently exists.

@erikgrinaker erikgrinaker force-pushed the connhealth-nodial branch 2 times, most recently from deb60be to ff37511 Compare September 20, 2021 16:14
`Dialer.ConnHealth` is used to check whether a healthy RPC connection
exists to a given node, in order to avoid interacting with unavailable
nodes. However, this actually attempted to dial the node if no
connection was found, which can block for tens of seconds in the case of
an unresponsive node. This is problematic since it is used in
performance-critical code paths, including Raft command application.

This patch changes `Dialer.ConnHealth` to avoid dialing the node, and
adds a `Context.ConnHealth` helper with access to the RPC connection
registry. Because `DistSQLPlanner` relied on `ConnHealth` to dial the
remote node, it also adds `Dialer.ConnHealthTryDial` which retains
the old behavior for use in DistSQL until a better solution can be
implemented.

Release note (bug fix): Avoid dialing nodes in performance-critical code
paths, which could cause substantial latency when encountering
unresponsive nodes (e.g. when a VM or server is shut down).
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

bors r=knz,tbg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 21, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rpc: Dialer.ConnHealth must not cause network IO

5 participants