Skip to content

[Train] fail fast if pg can never be met#54402

Merged
matthewdeng merged 17 commits intoray-project:masterfrom
xinyuangui2:train-fast-fail-resource
Jul 31, 2025
Merged

[Train] fail fast if pg can never be met#54402
matthewdeng merged 17 commits intoray-project:masterfrom
xinyuangui2:train-fast-fail-resource

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Jul 8, 2025

Why are these changes needed?

Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw WorkerGroupStartupFailedError. This will be wrapped by ControllerError and be raisen.

image

Related issue number

Closes #49372

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 marked this pull request as ready for review July 8, 2025 18:10
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 8, 2025 18:10
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 8, 2025 19:55
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the core changes into another PR?

f"Insufficient cluster resources. Worker requires {total_required_amount} "
f"{resource_name}, but cluster only has {available_amount} available."
)
raise WorkerGroupStartupFailedError(error_msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would end up in the "scheduling failure retry". In this case retries won't help at all -- we need a way to fast fail the controller. We could accomplish this with a custom error type (ex: InsufficientResourcesError and have it bypass the failure retry logic.

Throwing out an alternative design idea: What about moving the check to happen on the ScalingConfig on the driver process?

Something like:

def fit(self):
    self._validate_scaling_config()  # raises immediately on the driver if not possible to schedule

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In worker_group.py, I am able to use [worker_group_context.resources_per_worker] * worker_group_context.num_workers to represent the required resource.

I am not if we can get these information accurately at the beginning of fit function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaling config does have num_workers and resources_per_worker so it should be possible.

@matthewdeng Did you have an initial idea of whether this logic should happen before the controller gets created?

One benefit of the current way is that the run would still be logged to the dashboard.

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with the current method:

We need to handle the InsufficientResourcesError properly to move the controller into the ERRORED state. Right now it would just crash the controller task and exit ungracefully.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking it would be handled within the Controller. Not super opinionated on whether it is done within the Controller object or the WorkerGroup object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. This PR #54257 would catch the error and transfer from SCHEDULING -> ERROR.

xinyuangui2 and others added 3 commits July 8, 2025 23:31
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 9, 2025 21:59
@xinyuangui2
Copy link
Copy Markdown
Contributor Author

Can you separate the core changes into another PR?

Done: #54455

@cszhu cszhu added community-contribution Contributed by the community train Ray Train Related Issue labels Jul 10, 2025
@xinyuangui2 xinyuangui2 requested a review from matthewdeng July 11, 2025 00:06
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with some nits

xinyuangui2 and others added 2 commits July 14, 2025 17:38
@xinyuangui2 xinyuangui2 requested a review from TimothySeah July 15, 2025 01:12
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but will defer to @justinvyu / @matthewdeng who have merge permissions.

@xinyuangui2 xinyuangui2 requested a review from justinvyu July 15, 2025 01:42
@xinyuangui2 xinyuangui2 requested a review from matthewdeng July 29, 2025 19:02
@xinyuangui2 xinyuangui2 changed the title [train] fail fast if pg cannot be met [Train] fail fast if pg can never be met Jul 29, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can we also add an example log in the PR description?

Comment on lines +227 to +228
total_required_amount = required_amount * num_workers
available_amount = max_cluster_resources.get(resource_name, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an edge case where you can't create the placement group onto the nodes even though the total is satisfied:

4 nodes with 8 CPUs --> 32 CPUs
Each worker needs 5 CPUs
Each node can only fit 1 worker --> 4 total workers
Even though 32 > 30 CPUs for 6 workers

This is not high priority since most workloads just require 1 GPU per worker, which doesn't run into this issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 can we add a TODO for this? Though it might be more practical to wait for the validation to be completed at the placement group level instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a ticket to track this. The more fine grained resource validation needs to be done inside placement group logic.

xinyuangui2 and others added 2 commits July 29, 2025 15:14
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 30, 2025 16:30
@matthewdeng matthewdeng removed the community-contribution Contributed by the community label Jul 31, 2025
Copy link
Copy Markdown
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool to see this going through the ControllerError handling flow!

@matthewdeng matthewdeng enabled auto-merge (squash) July 31, 2025 05:17
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 31, 2025
@matthewdeng matthewdeng merged commit 2af7913 into ray-project:master Jul 31, 2025
7 checks passed
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: avigyabb <avigyabb@stanford.edu>
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: avigyabb <avigyabb@stanford.edu>
elliot-barn pushed a commit that referenced this pull request Aug 4, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Aug 5, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: minerharry <miner.harry567@gmail.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Aug 5, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: minerharry <miner.harry567@gmail.com>
mjacar pushed a commit to mjacar/ray that referenced this pull request Aug 5, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
Before waiting for the placement group to be ready, we check the cluster
info to see if this placement group can be met. If not, we directly
throw `WorkerGroupStartupFailedError`. This will be wrapped by
[ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27)
and be raised.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RayTrain] ScalingConfig resources_per_worker input validation/error handling

5 participants