[Core] Add TPU util to determine number of ready multi-host slices#61300
[Core] Add TPU util to determine number of ready multi-host slices#61300edoakes merged 10 commits intoray-project:masterfrom
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
cc: @edoakes @MengjinYan |
There was a problem hiding this comment.
Code Review
This pull request introduces a new utility, get_num_ready_tpu_slices, to determine the number of alive and complete TPU slices within a Ray cluster. This is intended for use with the Ray Train elastic policy. The changes include the utility function itself and a comprehensive suite of tests covering various cluster states. The implementation is generally solid, but I've identified a potential race condition in an optimization that could lead to incorrect results. My feedback focuses on improving the correctness of the utility.
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
/lgtm |
@MengjinYan I actually updated the util in the other PR based on the discussion here: #61299 (comment), I'm testing it to make sure the elastic training test passes and then I'll update it to match here - so this shouldn't be merged yet. The change is that we now call the State API to check actual available resources on the nodes in the slice. |
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
cc: @liulehui @MengjinYan I updated this PR based on the changes I made in #61299 to get it working. The util now calls the State API to check for actual available slices, rather than just alive ones. |
| if not pod_type: | ||
| return 0 | ||
|
|
||
| total_chips_expected = get_num_chips_from_topology(topology) |
There was a problem hiding this comment.
nit: total_chips_expected_per_slice?
| # Fetch live resource usage via the State API to ensure slices are idle. | ||
| from ray._private.state import available_resources_per_node | ||
|
|
||
| node_avail_resources = available_resources_per_node() |
There was a problem hiding this comment.
nit: maybe move it to L310 that closer to the per node availablity check?
Description
This PR adds a util to check for the number of alive, complete TPU slices in a RayCluster. This PR also adds better test coverage.
This utility is used in the Ray Train elastic policy to cap the number of workers that can be scaled by the AutoscalingCoordinator.
Related issues
#55162
Related PR: #61299