sql: clean up health checks#127293
Conversation
93e394b to
3cdfae7
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Nice! I do have some comments.
Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @stevendanna)
pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):
// sqlNodeAvailable encapsulates the various node health checks to avoid // planning on unhealthy nodes.
nit: s/unhealthy nodes/unhealthy SQL instances/.
pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):
// such a node is unhealthy, the retry-as-local mechanism would retry // the operation on the gateway. if err := sqlConnHealth(sqlInstanceID, addr); err != nil && !errors.Is(err, rpc.ErrNotHeartbeated) {
Hm, if we get ErrNotHeartbeated we will optimistically consider the instance healthy and cache this NodeOk result in the nodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?
pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):
} // CheckKVNodeHealth returns whether the node with the given SQLInstanceID is
nit: s/SQLInstanceID/NodeID/.
pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):
// CheckKVNodeHealth returns whether the node with the given SQLInstanceID is // available for DistSQL planning based on a number of factors. This function // checks either the kvConnHealth or the sqlConnHealth depending on whether an
nit: looks like the comment needs an update - we don't use sqlConnHealth here.
3cdfae7 to
7d4ad40
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
TFTR! If you get a chance to make one more pass at it before you leave it would be great! Otherwise I can find someone else to review.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, if we get
ErrNotHeartbeatedwe will optimistically consider the instance healthy and cache this NodeOk result in thenodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?
I'm looking at this again and better understand what it was trying to do. I ended up switching this condition (consider ErrNotHeartbeated as not OK and cache it for this plan). My understanding is that this cache only lives for the length of this planning session, and it seems strange that it would return different results across different calls.
I'm adding a additional commits to this PR which removes the added method ConnHealthTryDialInstance. This wasn't needed as we already have a dialer with an address resolver that should work. Also I added one more PR that I think fixed a subtle bug in the health checks if the health of a node changed from healthy to unhealthy at the right time. It probably wouldn't matter in practice, but I think the resulting code is easier to follow.
pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/SQLInstanceID/NodeID/.
Fixed
pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: looks like the comment needs an update - we don't use
sqlConnHealthhere.
Fixed
yuzefovich
left a comment
There was a problem hiding this comment.
I have a couple more questions to consider. I'll also be somewhat available next week so might take another look. It would also be good for @shubhamdhama to take a look too given that he just modified this code and might have more context than me.
Reviewed 2 of 2 files at r3, 6 of 6 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @shubhamdhama, and @stevendanna)
-- commits line 9 at r3:
nit: s/or/of/.
-- commits line 10 at r3:
nit: I think you meant "we won't use this for DistSQL physical planning".
pkg/server/server_sql.go line 955 at r3 (raw file):
// NB: This will return ErrNotHeartbeated on the first call, but that's fine // as we will plan without this node. The connection will still be
Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.
pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):
if n.InstanceID == dsp.gatewaySQLInstanceID || dsp.checkInstanceHealth(n.InstanceID, nodeStatusesCache) == NodeOK { instances[j] = n
Hm, I'm confused by the last commit - we're still modifying the original instances slice in-place on each call here, so it doesn't look like the commit message matches the code.
434b481 to
ec44093
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
Sounds good - I'm adding @shubhamdhama as a reviewer as well.
Also I wanted to try this out on a real cluster with SQL pods, but I'm not really sure how to do that... I wanted to make sure I didn't introduce any bugs as part of this refactoring.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama, @stevendanna, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/or/of/.
Fixed
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think you meant "we won't use this for DistSQL physical planning".
Yep - I cleaned this text up a little. Thanks!
pkg/server/server_sql.go line 955 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.
The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?
pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/unhealthy nodes/unhealthy SQL instances/.
Fixed (in previous revision, I'm not sure why my comment got lost)
pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, I'm confused by the last commit - we're still modifying the original
instancesslice in-place on each call here, so it doesn't look like the commit message matches the code.
You're right (sometimes I hate the go semantics for immutable slices...) I'm going to drop this last commit. I don't want to make a copy, and it probably isn't important.
Previously we added health checks to validate the health of SQL pods in addition to KV pods. Ideally the KV checks should be handled by the liveness framework while the SQL-SQL pod checks are outside the scope of the liveness framework. This change makes the code paths more explicit. This commit slightly changes the behavior of the cache. Now if there is no existing connection to a node on we will attempt to create the connection in the background but won't wait for it to be established and thereforew won't use it for this queries DistSQL plan. Epic: CRDB-39091 Informs: cockroachdb#120774 Release note: None
In a previous PR the method ConnHealthTryDialInstance was added. However this is unnecessary if the `nodedialer.Dialer` is correctly initialized with a working `addressResolver`. The only use of this method already set up the resolver correctly. Epic: CRDB-39091 Informs: cockroachdb#120774 Release note: None
ec44093 to
7a75870
Compare
shubhamdhama
left a comment
There was a problem hiding this comment.
(sorry for the delay, just back from a vacation)
I'm not sure about the real cluster but as for manual testing I used roachprod clusters.
# create the cluster
cluster=local-external &&
roachprod create -n 3 $cluster &&
roachprod put $cluster cockroach &&
roachprod start $cluster &&
roachprod start-sql thorium --storage-cluster $cluster:1-3 --external-nodes $cluster:1-3 &&
roachprod sql $cluster:1 --cluster thorium
Then in separate panes I ran
query="SELECT now(), info FROM [EXPLAIN (DISTSQL) SELECT sum(xsquared) FROM t] WHERE info LIKE 'Diagram%'"
roachprod sql $cluster:1 --cluster thorium -- -e $query --watch 300ms
and
roachprod stop $cluster:3
and observed the plans before and after running the above stop command.
If this is not what you are looking for, could you please clarify.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)
pkg/server/server_sql.go line 955 at r3 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?
I think this part of cleanup would change the behavior from the one mentioned in the deleted comment,
// Consider ErrNotHeartbeated as a temporary error (see its description) and
// avoid caching its result, as it can resolve to a more accurate result soon.
// It's more reasonable to plan on using this node, and if such a node is
// unhealthy, the retry-as-local mechanism would retry the operation on the
// gateway.
Is this intentional?
As for how to reproduce this, if you stress test TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode with your current changes I think you will hit this issue. See related PR #126976 where we started considering ErrNotHeartbeated as healthy but avoids caching in nodeStatuses.
Previously we added health checks to validate the health of SQL pods in addition to KV pods. Ideally the KV checks should be handled by the liveness framework while the SQL-SQL pod checks are outside the scope of the liveness framework. This change makes the code paths more explicit.
Epic: CRDB-39091
Informs: #120774
Release note: None