Skip to content

sql: clean up health checks#127293

Open
andrewbaptist wants to merge 2 commits intocockroachdb:masterfrom
andrewbaptist:2024-07-16-sql-liveness
Open

sql: clean up health checks#127293
andrewbaptist wants to merge 2 commits intocockroachdb:masterfrom
andrewbaptist:2024-07-16-sql-liveness

Conversation

@andrewbaptist
Copy link
Copy Markdown

Previously we added health checks to validate the health of SQL pods in addition to KV pods. Ideally the KV checks should be handled by the liveness framework while the SQL-SQL pod checks are outside the scope of the liveness framework. This change makes the code paths more explicit.

Epic: CRDB-39091

Informs: #120774

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andrewbaptist andrewbaptist requested a review from yuzefovich July 16, 2024 22:17
@andrewbaptist andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 93e394b to 3cdfae7 Compare July 17, 2024 13:33
@andrewbaptist andrewbaptist marked this pull request as ready for review July 17, 2024 13:38
@andrewbaptist andrewbaptist requested review from a team as code owners July 17, 2024 13:38
@andrewbaptist andrewbaptist requested review from stevendanna and removed request for a team July 17, 2024 13:38
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I do have some comments.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @stevendanna)


pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):

	// sqlNodeAvailable encapsulates the various node health checks to avoid
	// planning on unhealthy nodes.

nit: s/unhealthy nodes/unhealthy SQL instances/.


pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):

	// such a node is unhealthy, the retry-as-local mechanism would retry
	// the operation on the gateway.
	if err := sqlConnHealth(sqlInstanceID, addr); err != nil && !errors.Is(err, rpc.ErrNotHeartbeated) {

Hm, if we get ErrNotHeartbeated we will optimistically consider the instance healthy and cache this NodeOk result in the nodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?


pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):

}

// CheckKVNodeHealth returns whether the node with the given SQLInstanceID is

nit: s/SQLInstanceID/NodeID/.


pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):

// CheckKVNodeHealth returns whether the node with the given SQLInstanceID is
// available for DistSQL planning based on a number of factors. This function
// checks either the kvConnHealth or the sqlConnHealth depending on whether an

nit: looks like the comment needs an update - we don't use sqlConnHealth here.

@andrewbaptist andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 3cdfae7 to 7d4ad40 Compare July 26, 2024 12:25
Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! If you get a chance to make one more pass at it before you leave it would be great! Otherwise I can find someone else to review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1262 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, if we get ErrNotHeartbeated we will optimistically consider the instance healthy and cache this NodeOk result in the nodeStatusesCache. Before this patch we would only do the former part (i.e. we wouldn't cache this check). Is this change intentional?

I'm looking at this again and better understand what it was trying to do. I ended up switching this condition (consider ErrNotHeartbeated as not OK and cache it for this plan). My understanding is that this cache only lives for the length of this planning session, and it seems strange that it would return different results across different calls.

I'm adding a additional commits to this PR which removes the added method ConnHealthTryDialInstance. This wasn't needed as we already have a dialer with an address resolver that should work. Also I added one more PR that I think fixed a subtle bug in the health checks if the health of a node changed from healthy to unhealthy at the right time. It probably wouldn't matter in practice, but I think the resulting code is easier to follow.


pkg/sql/distsql_physical_planner.go line 1269 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/SQLInstanceID/NodeID/.

Fixed


pkg/sql/distsql_physical_planner.go line 1271 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: looks like the comment needs an update - we don't use sqlConnHealth here.

Fixed

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple more questions to consider. I'll also be somewhat available next week so might take another look. It would also be good for @shubhamdhama to take a look too given that he just modified this code and might have more context than me.

Reviewed 2 of 2 files at r3, 6 of 6 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @shubhamdhama, and @stevendanna)


-- commits line 9 at r3:
nit: s/or/of/.


-- commits line 10 at r3:
nit: I think you meant "we won't use this for DistSQL physical planning".


pkg/server/server_sql.go line 955 at r3 (raw file):

	// NB: This will return ErrNotHeartbeated on the first call, but that's fine
	// as we will plan without this node. The connection will still be

Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.


pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):

		if n.InstanceID == dsp.gatewaySQLInstanceID ||
			dsp.checkInstanceHealth(n.InstanceID, nodeStatusesCache) == NodeOK {
			instances[j] = n

Hm, I'm confused by the last commit - we're still modifying the original instances slice in-place on each call here, so it doesn't look like the commit message matches the code.

@andrewbaptist andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from 434b481 to ec44093 Compare July 29, 2024 13:49
Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I'm adding @shubhamdhama as a reviewer as well.

Also I wanted to try this out on a real cluster with SQL pods, but I'm not really sure how to do that... I wanted to make sure I didn't introduce any bugs as part of this refactoring.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama, @stevendanna, and @yuzefovich)


-- commits line 9 at r3:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/or/of/.

Fixed


-- commits line 10 at r3:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think you meant "we won't use this for DistSQL physical planning".

Yep - I cleaned this text up a little. Thanks!


pkg/server/server_sql.go line 955 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do you know how long this connection, once established, will be maintained? I wonder whether it'll be possible that we won't ever use some of the SQL pods because of scaling up and down that would break the connections between them.

The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?


pkg/sql/distsql_physical_planner.go line 119 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/unhealthy nodes/unhealthy SQL instances/.

Fixed (in previous revision, I'm not sure why my comment got lost)


pkg/sql/distsql_physical_planner.go line 1683 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I'm confused by the last commit - we're still modifying the original instances slice in-place on each call here, so it doesn't look like the commit message matches the code.

You're right (sometimes I hate the go semantics for immutable slices...) I'm going to drop this last commit. I don't want to make a copy, and it probably isn't important.

@andrewbaptist andrewbaptist requested review from yuzefovich and removed request for shubhamdhama July 29, 2024 13:50
Andrew Baptist added 2 commits July 29, 2024 10:13
Previously we added health checks to validate the health of SQL pods in
addition to KV pods. Ideally the KV checks should be handled by the
liveness framework while the SQL-SQL pod checks are outside the scope of
the liveness framework. This change makes the code paths more explicit.

This commit slightly changes the behavior of the cache. Now if there is
no existing connection to a node on we will attempt to create the
connection in the background but won't wait for it to be established and
thereforew won't use it for this queries DistSQL plan.

Epic: CRDB-39091

Informs: cockroachdb#120774

Release note: None
In a previous PR the method ConnHealthTryDialInstance was added. However
this is unnecessary if the `nodedialer.Dialer` is correctly initialized
with a working `addressResolver`. The only use of this method already
set up the resolver correctly.

Epic: CRDB-39091

Informs: cockroachdb#120774

Release note: None
@andrewbaptist andrewbaptist force-pushed the 2024-07-16-sql-liveness branch from ec44093 to 7a75870 Compare July 29, 2024 14:22
Copy link
Copy Markdown
Contributor

@shubhamdhama shubhamdhama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for the delay, just back from a vacation)

I'm not sure about the real cluster but as for manual testing I used roachprod clusters.

# create the cluster
cluster=local-external &&
    roachprod create -n 3 $cluster &&
    roachprod put $cluster cockroach &&
    roachprod start $cluster &&
    roachprod start-sql thorium --storage-cluster $cluster:1-3 --external-nodes $cluster:1-3 &&
    roachprod sql $cluster:1 --cluster thorium

Then in separate panes I ran

query="SELECT now(), info FROM [EXPLAIN (DISTSQL) SELECT sum(xsquared) FROM t] WHERE info LIKE 'Diagram%'"
roachprod sql $cluster:1 --cluster thorium -- -e $query --watch 300ms

and

roachprod stop $cluster:3

and observed the plans before and after running the above stop command.

If this is not what you are looking for, could you please clarify.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @yuzefovich)


pkg/server/server_sql.go line 955 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

The connection SHOULD stay up forever once it is established. The only thing that breaks the connection is if the node goes down and we can't re-establish it. That said, there may be some timeouts that I'm not aware of that will break it. I'm not positive how to test this code. Maybe @shubhamdhama can help with that part as well?

I think this part of cleanup would change the behavior from the one mentioned in the deleted comment,

			// Consider ErrNotHeartbeated as a temporary error (see its description) and
			// avoid caching its result, as it can resolve to a more accurate result soon.
			// It's more reasonable to plan on using this node, and if such a node is
			// unhealthy, the retry-as-local mechanism would retry the operation on the
			// gateway.

Is this intentional?

As for how to reproduce this, if you stress test TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode with your current changes I think you will hit this issue. See related PR #126976 where we started considering ErrNotHeartbeated as healthy but avoids caching in nodeStatuses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants