rpc: avoid network IO in Dialer.ConnHealth#70017
rpc: avoid network IO in Dialer.ConnHealth#70017craig[bot] merged 1 commit intocockroachdb:masterfrom
Dialer.ConnHealth#70017Conversation
07bf1d8 to
f22d265
Compare
|
@cockroachdb/sql-queries This change causes
|
|
With regards to making Assuming we keep the new behavior of the DistSQL planner, I'm curious about the behavior in the following scenario: imagine we have a 3 node cluster with a table If my logic is correct, it looks like by not establishing the RPC connections during the DistSQL physical planning, we're simply delaying that in this scenario. |
|
Just popping in here without all of the context (apologies) but did we do
an audit that makes sure that `ConnHealth` prompting a dial isn't used
anywhere else? Should we be conservative and put a non-blocking
singleflight dial attempt into ConnHealth to avoid any problems? Or are we
confident this isn't an issue?
|
Yeah, so the problem with this is that we're not guaranteed that other components will actually maintain RPC connections. It may fix the test, but end up breaking production workloads in some corner cases. I think the most conservative option would be 1, which is equivalent to what the current code already does. Long term, I think we want to do some variant of 3.
I believe that's correct. The
Not fully, I was going to rely on a first CI pass to find any obvious regressions. I did a quick pass now:
It appears to me like DistSQL is the odd one out here, the other components have mechanisms to maintain RPC connections. Either adding an explicit dial (to be conservative) or relying on DistSender to recover the connections should both be viable options, I think I'd prefer the explicit dial for now.
I see that I'm not at all confident that this isn't an issue though, since the RPC dial story is pretty messy, so I'd have to rely on tests to pick up any corner cases. |
|
Opened #70111 to track a new component that maintains RPC connections (i.e. option 3 in #70017 (comment)). |
|
I see, thanks for all the context Erik. I agree with your assessment that option 1 seems like the safest and easiest choice for now. |
f22d265 to
6f5a95e
Compare
|
Added |
knz
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/rpc/nodedialer/nodedialer.go, line 230 at r2 (raw file):
// down), and should be avoided in latency-sensitive code paths. Preferably, // this should be replaced entirely by some other mechanism to maintain RPC // connections such as https://github.com/cockroachdb/cockroach/issues/70111
nit (style):
// to maintain RPC connections.
// See also: https:// ...
6f5a95e to
1c8934a
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)
pkg/rpc/context.go, line 689 at r3 (raw file):
// ConnHealth returns nil if we have an open connection of the request // class to the given node that succeeded on its most recent heartbeat.
// Returns ErrNoConnection if no connection to the node currently exists.
deb60be to
ff37511
Compare
`Dialer.ConnHealth` is used to check whether a healthy RPC connection exists to a given node, in order to avoid interacting with unavailable nodes. However, this actually attempted to dial the node if no connection was found, which can block for tens of seconds in the case of an unresponsive node. This is problematic since it is used in performance-critical code paths, including Raft command application. This patch changes `Dialer.ConnHealth` to avoid dialing the node, and adds a `Context.ConnHealth` helper with access to the RPC connection registry. Because `DistSQLPlanner` relied on `ConnHealth` to dial the remote node, it also adds `Dialer.ConnHealthTryDial` which retains the old behavior for use in DistSQL until a better solution can be implemented. Release note (bug fix): Avoid dialing nodes in performance-critical code paths, which could cause substantial latency when encountering unresponsive nodes (e.g. when a VM or server is shut down).
ff37511 to
cee675a
Compare
|
bors r=knz,tbg |
|
Build succeeded: |
Dialer.ConnHealthis used to check whether a healthy RPC connectionexists to a given node, in order to avoid interacting with unavailable
nodes. However, this actually attempted to dial the node if no
connection was found, which can block for tens of seconds in the case of
an unresponsive node. This is problematic since it is used in
performance-critical code paths, including Raft command application.
This patch changes
Dialer.ConnHealthto avoid dialing the node, andadds a
Context.ConnHealthhelper with access to the RPC connectionregistry. Because
DistSQLPlannerrelied onConnHealthto dial theremote node, it also adds
Dialer.ConnHealthTryDialwhich retainsthe old behavior for use in DistSQL until a better solution can be
implemented.
Resolves #69888.
Release note (bug fix): Avoid dialing nodes in performance-critical code
paths, which could cause substantial latency when encountering
unresponsive nodes (e.g. when a VM or server is shut down).