[Core] Add TPU util to determine number of ready multi-host slices by ryanaoleary · Pull Request #61300 · ray-project/ray

ryanaoleary · 2026-02-25T02:07:00Z

Description

This PR adds a util to check for the number of alive, complete TPU slices in a RayCluster. This PR also adds better test coverage.

This utility is used in the Ray Train elastic policy to cap the number of workers that can be scaled by the AutoscalingCoordinator.

Related issues

#55162

Related PR: #61299

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-02-25T02:07:09Z

cc: @edoakes @MengjinYan

gemini-code-assist

Code Review

This pull request introduces a new utility, get_num_ready_tpu_slices, to determine the number of alive and complete TPU slices within a Ray cluster. This is intended for use with the Ray Train elastic policy. The changes include the utility function itself and a comprehensive suite of tests covering various cluster states. The implementation is generally solid, but I've identified a potential race condition in an optimization that could lead to incorrect results. My feedback focuses on improving the correctness of the utility.

python/ray/util/tpu.py

liulehui

tyty!!

python/ray/util/tpu.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

siyuanfoundation · 2026-03-04T23:32:56Z

/lgtm

MengjinYan

Thanks!

ryanaoleary · 2026-03-10T06:04:19Z

Thanks!

@MengjinYan I actually updated the util in the other PR based on the discussion here: #61299 (comment), I'm testing it to make sure the elastic training test passes and then I'll update it to match here - so this shouldn't be merged yet. The change is that we now call the State API to check actual available resources on the nodes in the slice.

liulehui

❤️

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-03-12T07:02:45Z

cc: @liulehui @MengjinYan I updated this PR based on the changes I made in #61299 to get it working. The util now calls the State API to check for actual available slices, rather than just alive ones.

liulehui

thank you!

liulehui · 2026-03-12T21:00:19Z

python/ray/util/tpu.py

+        if not pod_type:
+            return 0
+
+        total_chips_expected = get_num_chips_from_topology(topology)


nit: total_chips_expected_per_slice?

liulehui · 2026-03-12T21:05:39Z

python/ray/util/tpu.py

+    # Fetch live resource usage via the State API to ensure slices are idle.
+    from ray._private.state import available_resources_per_node
+
+    node_avail_resources = available_resources_per_node()


nit: maybe move it to L310 that closer to the per node availablity check?

[Core] Add TPU util to determine number of ready multi-host slices

fbfa241

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner February 25, 2026 02:07

Merge branch 'master' into ready-tpu-slices-util

6cbac1e

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

ray-gardener bot added the community-contribution Contributed by the community label Feb 25, 2026

liulehui reviewed Feb 25, 2026

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Show resolved Hide resolved

ryanaoleary added 2 commits February 26, 2026 00:49

Update var name to be more descriptive

d4eb181

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Add two new utils and tests

e036346

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from liulehui February 26, 2026 01:14

ryanaoleary and others added 3 commits March 4, 2026 01:34

Fix tpu slice readiness calculation - shouldn't check num workers

f7c9ed5

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

remove early exit

9867ef8

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into ready-tpu-slices-util

2884f3d

ryanaoleary mentioned this pull request Mar 4, 2026

[Train] Update elastic policy to handle multi-host TPUs with JaxTrainer #61299

Merged

ryanaoleary mentioned this pull request Mar 4, 2026

[Train] JaxTrainer Implementation Tracking Issue #55162

Open

MengjinYan approved these changes Mar 10, 2026

View reviewed changes

MengjinYan added the go add ONLY when ready to merge, run all tests label Mar 10, 2026

Merge branch 'master' into ready-tpu-slices-util

b38fc3b

liulehui approved these changes Mar 11, 2026

View reviewed changes

ryanaoleary and others added 2 commits March 12, 2026 07:01

Add in changes tested with elastic TPU PR

28721cd

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into ready-tpu-slices-util

d69a71f

ryanaoleary requested review from MengjinYan and liulehui March 12, 2026 07:02

liulehui approved these changes Mar 12, 2026

View reviewed changes

edoakes merged commit d334cd6 into ray-project:master Mar 12, 2026
6 checks passed

Conversation

ryanaoleary commented Feb 25, 2026

Description

Related issues

Uh oh!

ryanaoleary commented Feb 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

liulehui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

siyuanfoundation commented Mar 4, 2026

Uh oh!

MengjinYan left a comment

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Mar 10, 2026

Uh oh!

liulehui left a comment

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Mar 12, 2026

Uh oh!

liulehui left a comment

Choose a reason for hiding this comment

Uh oh!

liulehui Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

liulehui Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants