sql: clean up health checks by andrewbaptist · Pull Request #127293 · cockroachdb/cockroach

andrewbaptist · 2024-07-16T22:16:52Z

Previously we added health checks to validate the health of SQL pods in addition to KV pods. Ideally the KV checks should be handled by the liveness framework while the SQL-SQL pod checks are outside the scope of the liveness framework. This change makes the code paths more explicit.

Epic: CRDB-39091

Informs: #120774

Release note: None

cockroach-teamcity · 2024-07-16T22:16:59Z

This change is

yuzefovich

Nice! I do have some comments.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @stevendanna)

pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):

	// sqlNodeAvailable encapsulates the various node health checks to avoid
	// planning on unhealthy nodes.

nit: s/unhealthy nodes/unhealthy SQL instances/.

pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):

	// such a node is unhealthy, the retry-as-local mechanism would retry
	// the operation on the gateway.
	if err := sqlConnHealth(sqlInstanceID, addr); err != nil && !errors.Is(err, rpc.ErrNotHeartbeated) {

Hm, if we get ErrNotHeartbeated we will optimistically consider the instance healthy and cache this NodeOk result in the nodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?

pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):

}

// CheckKVNodeHealth returns whether the node with the given SQLInstanceID is

nit: s/SQLInstanceID/NodeID/.

pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):

// CheckKVNodeHealth returns whether the node with the given SQLInstanceID is
// available for DistSQL planning based on a number of factors. This function
// checks either the kvConnHealth or the sqlConnHealth depending on whether an

nit: looks like the comment needs an update - we don't use sqlConnHealth here.

andrewbaptist

TFTR! If you get a chance to make one more pass at it before you leave it would be great! Otherwise I can find someone else to review.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, if we get ErrNotHeartbeated we will optimistically consider the instance healthy and cache this NodeOk result in the nodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?

I'm looking at this again and better understand what it was trying to do. I ended up switching this condition (consider ErrNotHeartbeated as not OK and cache it for this plan). My understanding is that this cache only lives for the length of this planning session, and it seems strange that it would return different results across different calls.

I'm adding a additional commits to this PR which removes the added method ConnHealthTryDialInstance. This wasn't needed as we already have a dialer with an address resolver that should work. Also I added one more PR that I think fixed a subtle bug in the health checks if the health of a node changed from healthy to unhealthy at the right time. It probably wouldn't matter in practice, but I think the resulting code is easier to follow.

pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/SQLInstanceID/NodeID/.

Fixed

pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: looks like the comment needs an update - we don't use sqlConnHealth here.

Fixed

yuzefovich

I have a couple more questions to consider. I'll also be somewhat available next week so might take another look. It would also be good for @shubhamdhama to take a look too given that he just modified this code and might have more context than me.

Reviewed 2 of 2 files at r3, 6 of 6 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @shubhamdhama, and @stevendanna)

-- commits line 9 at r3:
nit: s/or/of/.

-- commits line 10 at r3:
nit: I think you meant "we won't use this for DistSQL physical planning".

pkg/server/server_sql.go line 955 at r3 (raw file):

	// NB: This will return ErrNotHeartbeated on the first call, but that's fine
	// as we will plan without this node. The connection will still be

Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.

pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):

		if n.InstanceID == dsp.gatewaySQLInstanceID ||
			dsp.checkInstanceHealth(n.InstanceID, nodeStatusesCache) == NodeOK {
			instances[j] = n

Hm, I'm confused by the last commit - we're still modifying the original instances slice in-place on each call here, so it doesn't look like the commit message matches the code.

andrewbaptist

Sounds good - I'm adding @shubhamdhama as a reviewer as well.

Also I wanted to try this out on a real cluster with SQL pods, but I'm not really sure how to do that... I wanted to make sure I didn't introduce any bugs as part of this refactoring.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama, @stevendanna, and @yuzefovich)

-- commits line 9 at r3:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/or/of/.

Fixed

-- commits line 10 at r3:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think you meant "we won't use this for DistSQL physical planning".

Yep - I cleaned this text up a little. Thanks!

pkg/server/server_sql.go line 955 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.

The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?

pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/unhealthy nodes/unhealthy SQL instances/.

Fixed (in previous revision, I'm not sure why my comment got lost)

pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I'm confused by the last commit - we're still modifying the original instances slice in-place on each call here, so it doesn't look like the commit message matches the code.

You're right (sometimes I hate the go semantics for immutable slices...) I'm going to drop this last commit. I don't want to make a copy, and it probably isn't important.

Previously we added health checks to validate the health of SQL pods in addition to KV pods. Ideally the KV checks should be handled by the liveness framework while the SQL-SQL pod checks are outside the scope of the liveness framework. This change makes the code paths more explicit. This commit slightly changes the behavior of the cache. Now if there is no existing connection to a node on we will attempt to create the connection in the background but won't wait for it to be established and thereforew won't use it for this queries DistSQL plan. Epic: CRDB-39091 Informs: cockroachdb#120774 Release note: None

In a previous PR the method ConnHealthTryDialInstance was added. However this is unnecessary if the `nodedialer.Dialer` is correctly initialized with a working `addressResolver`. The only use of this method already set up the resolver correctly. Epic: CRDB-39091 Informs: cockroachdb#120774 Release note: None

shubhamdhama

(sorry for the delay, just back from a vacation)

I'm not sure about the real cluster but as for manual testing I used roachprod clusters.

# create the cluster
cluster=local-external &&
    roachprod create -n 3 $cluster &&
    roachprod put $cluster cockroach &&
    roachprod start $cluster &&
    roachprod start-sql thorium --storage-cluster $cluster:1-3 --external-nodes $cluster:1-3 &&
    roachprod sql $cluster:1 --cluster thorium

Then in separate panes I ran

query="SELECT now(), info FROM [EXPLAIN (DISTSQL) SELECT sum(xsquared) FROM t] WHERE info LIKE 'Diagram%'"
roachprod sql $cluster:1 --cluster thorium -- -e $query --watch 300ms

and

roachprod stop $cluster:3

and observed the plans before and after running the above stop command.

If this is not what you are looking for, could you please clarify.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)

pkg/server/server_sql.go line 955 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?

I think this part of cleanup would change the behavior from the one mentioned in the deleted comment,

			// Consider ErrNotHeartbeated as a temporary error (see its description) and
			// avoid caching its result, as it can resolve to a more accurate result soon.
			// It's more reasonable to plan on using this node, and if such a node is
			// unhealthy, the retry-as-local mechanism would retry the operation on the
			// gateway.

Is this intentional?

As for how to reproduce this, if you stress test TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode with your current changes I think you will hit this issue. See related PR #126976 where we started considering ErrNotHeartbeated as healthy but avoids caching in nodeStatuses.

andrewbaptist requested a review from yuzefovich July 16, 2024 22:17

andrewbaptist mentioned this pull request Jul 16, 2024

distsql: improve unhealthy node detection in SQL instances-based planning. #124986

Merged

andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 93e394b to 3cdfae7 Compare July 17, 2024 13:33

andrewbaptist marked this pull request as ready for review July 17, 2024 13:38

andrewbaptist requested review from a team as code owners July 17, 2024 13:38

andrewbaptist requested review from stevendanna and removed request for a team July 17, 2024 13:38

yuzefovich reviewed Jul 18, 2024

View reviewed changes

andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 3cdfae7 to 7d4ad40 Compare July 26, 2024 12:25

andrewbaptist commented Jul 26, 2024

View reviewed changes

andrewbaptist requested a review from yuzefovich July 26, 2024 16:49

yuzefovich requested a review from shubhamdhama July 26, 2024 21:44

yuzefovich reviewed Jul 26, 2024

View reviewed changes

andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 434b481 to ec44093 Compare July 29, 2024 13:49

andrewbaptist commented Jul 29, 2024

View reviewed changes

andrewbaptist requested review from yuzefovich and removed request for shubhamdhama July 29, 2024 13:50

Andrew Baptist added 2 commits July 29, 2024 10:13

andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from ec44093 to 7a75870 Compare July 29, 2024 14:22

shubhamdhama reviewed Jul 31, 2024

View reviewed changes

shubhamdhama mentioned this pull request Sep 6, 2024

sql,server: unify sqlInstanceDialer initialization using instancereader-based dialer. #130235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: clean up health checks#127293

sql: clean up health checks#127293
andrewbaptist wants to merge 2 commits intocockroachdb:masterfrom
andrewbaptist:2024-07-16-sql-liveness

andrewbaptist commented Jul 16, 2024

Uh oh!

cockroach-teamcity commented Jul 16, 2024

Uh oh!

yuzefovich left a comment

Uh oh!

andrewbaptist left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

andrewbaptist left a comment

Uh oh!

shubhamdhama left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andrewbaptist commented Jul 16, 2024

Uh oh!

cockroach-teamcity commented Jul 16, 2024

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

shubhamdhama left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants