Skip to content

[Core] Add TPU util to determine number of ready multi-host slices#61300

Merged
edoakes merged 10 commits intoray-project:masterfrom
ryanaoleary:ready-tpu-slices-util
Mar 12, 2026
Merged

[Core] Add TPU util to determine number of ready multi-host slices#61300
edoakes merged 10 commits intoray-project:masterfrom
ryanaoleary:ready-tpu-slices-util

Conversation

@ryanaoleary
Copy link
Copy Markdown
Contributor

Description

This PR adds a util to check for the number of alive, complete TPU slices in a RayCluster. This PR also adds better test coverage.

This utility is used in the Ray Train elastic policy to cap the number of workers that can be scaled by the AutoscalingCoordinator.

Related issues

#55162

Related PR: #61299

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner February 25, 2026 02:07
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

cc: @edoakes @MengjinYan

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new utility, get_num_ready_tpu_slices, to determine the number of alive and complete TPU slices within a Ray cluster. This is intended for use with the Ray Train elastic policy. The changes include the utility function itself and a comprehensive suite of tests covering various cluster states. The implementation is generally solid, but I've identified a potential race condition in an optimization that could lead to incorrect results. My feedback focuses on improving the correctness of the utility.

@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 25, 2026
Copy link
Copy Markdown
Contributor

@liulehui liulehui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tyty!!

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from liulehui February 26, 2026 01:14
ryanaoleary and others added 3 commits March 4, 2026 01:34
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@siyuanfoundation
Copy link
Copy Markdown
Contributor

/lgtm

Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Mar 10, 2026
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

Thanks!

@MengjinYan I actually updated the util in the other PR based on the discussion here: #61299 (comment), I'm testing it to make sure the elastic training test passes and then I'll update it to match here - so this shouldn't be merged yet. The change is that we now call the State API to check actual available resources on the nodes in the slice.

Copy link
Copy Markdown
Contributor

@liulehui liulehui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@ryanaoleary
Copy link
Copy Markdown
Contributor Author

cc: @liulehui @MengjinYan I updated this PR based on the changes I made in #61299 to get it working. The util now calls the State API to check for actual available slices, rather than just alive ones.

Copy link
Copy Markdown
Contributor

@liulehui liulehui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

if not pod_type:
return 0

total_chips_expected = get_num_chips_from_topology(topology)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: total_chips_expected_per_slice?

# Fetch live resource usage via the State API to ensure slices are idle.
from ray._private.state import available_resources_per_node

node_avail_resources = available_resources_per_node()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move it to L310 that closer to the per node availablity check?

@edoakes edoakes merged commit d334cd6 into ray-project:master Mar 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants