[Train] fail fast if pg can never be met by xinyuangui2 · Pull Request #54402 · ray-project/ray

xinyuangui2 · 2025-07-08T05:28:14Z

Why are these changes needed?

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw WorkerGroupStartupFailedError. This will be wrapped by ControllerError and be raisen.

Related issue number

Closes #49372

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu

Can you separate the core changes into another PR?

justinvyu · 2025-07-08T22:25:54Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+                        f"Insufficient cluster resources. Worker requires {total_required_amount} "
+                        f"{resource_name}, but cluster only has {available_amount} available."
+                    )
+                    raise WorkerGroupStartupFailedError(error_msg)


This would end up in the "scheduling failure retry". In this case retries won't help at all -- we need a way to fast fail the controller. We could accomplish this with a custom error type (ex: InsufficientResourcesError and have it bypass the failure retry logic.

Throwing out an alternative design idea: What about moving the check to happen on the ScalingConfig on the driver process?

Something like:

def fit(self): self._validate_scaling_config() # raises immediately on the driver if not possible to schedule

In worker_group.py, I am able to use [worker_group_context.resources_per_worker] * worker_group_context.num_workers to represent the required resource.

I am not if we can get these information accurately at the beginning of fit function.

The scaling config does have num_workers and resources_per_worker so it should be possible.

@matthewdeng Did you have an initial idea of whether this logic should happen before the controller gets created?

One benefit of the current way is that the run would still be logged to the dashboard.

If we go with the current method:

We need to handle the InsufficientResourcesError properly to move the controller into the ERRORED state. Right now it would just crash the controller task and exit ungracefully.

Oh I see you did it in the other PR: https://github.com/ray-project/ray/pull/54257/files#diff-4161b23c45b953b8c90f938eb49acf72441246ee7ca70d889c1be17d967417caR43-R48

Yeah I was thinking it would be handled within the Controller. Not super opinionated on whether it is done within the Controller object or the WorkerGroup object.

Good call. This PR #54257 would catch the error and transfer from SCHEDULING -> ERROR.

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 · 2025-07-09T21:59:50Z

Can you separate the core changes into another PR?

Done: #54455

TimothySeah

lgtm with some nits

python/ray/train/v2/_internal/exceptions.py

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

python/ray/train/v2/tests/test_worker_group.py

Signed-off-by: xgui <xgui@anyscale.com>

TimothySeah

LGTM but will defer to @justinvyu / @matthewdeng who have merge permissions.

justinvyu

Thanks! Can we also add an example log in the PR description?

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

justinvyu · 2025-07-29T20:44:15Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+                total_required_amount = required_amount * num_workers
+                available_amount = max_cluster_resources.get(resource_name, 0)


there's an edge case where you can't create the placement group onto the nodes even though the total is satisfied:

4 nodes with 8 CPUs --> 32 CPUs Each worker needs 5 CPUs Each node can only fit 1 worker --> 4 total workers Even though 32 > 30 CPUs for 6 workers

This is not high priority since most workloads just require 1 GPU per worker, which doesn't run into this issue.

+1 can we add a TODO for this? Though it might be more practical to wait for the validation to be completed at the placement group level instead.

I created a ticket to track this. The more fine grained resource validation needs to be done inside placement group logic.

python/ray/train/v2/_internal/exceptions.py

python/ray/train/v2/tests/test_worker_group.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

matthewdeng

Really cool to see this going through the ControllerError handling flow!

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: avigyabb <avigyabb@stanford.edu>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: minerharry <miner.harry567@gmail.com>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

xinyuangui2 added 4 commits July 8, 2025 05:27

fail fast if pg cannot be met

e2b2835

Signed-off-by: xgui <xgui@anyscale.com>

refactor into a function and add unittest

a51d8b7

Signed-off-by: xgui <xgui@anyscale.com>

fix unittest

f80c256

Signed-off-by: xgui <xgui@anyscale.com>

add one test to verify the working group starting function

ff67c40

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 marked this pull request as ready for review July 8, 2025 18:10

xinyuangui2 requested a review from a team as a code owner July 8, 2025 18:10

xinyuangui2 and others added 2 commits July 8, 2025 12:11

Merge branch 'master' into train-fast-fail-resource

571737d

use max_resources api

63fb73b

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner July 8, 2025 19:55

xinyuangui2 and others added 2 commits July 8, 2025 13:02

Merge branch 'master' into train-fast-fail-resource

8ebd63b

remove redundant API

89fe272

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu reviewed Jul 8, 2025

View reviewed changes

xinyuangui2 and others added 3 commits July 8, 2025 23:31

fix state unittest

8a1bf18

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'master' into train-fast-fail-resource

29ac1b4

separate the core change

0b155c4

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 9, 2025 21:59

Merge branch 'master' into train-fast-fail-resource

f3115ad

cszhu added community-contribution Contributed by the community train Ray Train Related Issue labels Jul 10, 2025

xinyuangui2 requested a review from matthewdeng July 11, 2025 00:06

TimothySeah reviewed Jul 14, 2025

View reviewed changes

xinyuangui2 removed request for justinvyu and matthewdeng July 15, 2025 00:32

xinyuangui2 and others added 2 commits July 14, 2025 17:38

Merge branch 'master' into train-fast-fail-resource

4aa80df

resolve comments

0f19803

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from TimothySeah July 15, 2025 01:12

TimothySeah approved these changes Jul 15, 2025

View reviewed changes

xinyuangui2 requested a review from justinvyu July 15, 2025 01:42

Merge branch 'master' into train-fast-fail-resource

5d7a958

xinyuangui2 requested a review from matthewdeng July 29, 2025 19:02

xinyuangui2 changed the title ~~[train] fail fast if pg cannot be met~~ [Train] fail fast if pg can never be met Jul 29, 2025

justinvyu reviewed Jul 29, 2025

View reviewed changes

xinyuangui2 and others added 2 commits July 29, 2025 15:14

Update python/ray/train/v2/_internal/exceptions.py

3411999

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

resolve comments

cf6c4a6

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 30, 2025 16:30

matthewdeng removed the community-contribution Contributed by the community label Jul 31, 2025

matthewdeng approved these changes Jul 31, 2025

View reviewed changes

matthewdeng enabled auto-merge (squash) July 31, 2025 05:17

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 31, 2025

matthewdeng merged commit 2af7913 into ray-project:master Jul 31, 2025
7 checks passed

		total_required_amount = required_amount * num_workers
		available_amount = max_cluster_resources.get(resource_name, 0)

Conversation

xinyuangui2 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 commented Jul 9, 2025

Uh oh!

TimothySeah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimothySeah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xinyuangui2 commented Jul 8, 2025 •

edited

Loading

justinvyu Jul 10, 2025 •

edited

Loading

TimothySeah left a comment •

edited

Loading