[Train] fail fast if pg can never be met#54402
[Train] fail fast if pg can never be met#54402matthewdeng merged 17 commits intoray-project:masterfrom
Conversation
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Can you separate the core changes into another PR?
| f"Insufficient cluster resources. Worker requires {total_required_amount} " | ||
| f"{resource_name}, but cluster only has {available_amount} available." | ||
| ) | ||
| raise WorkerGroupStartupFailedError(error_msg) |
There was a problem hiding this comment.
This would end up in the "scheduling failure retry". In this case retries won't help at all -- we need a way to fast fail the controller. We could accomplish this with a custom error type (ex: InsufficientResourcesError and have it bypass the failure retry logic.
Throwing out an alternative design idea: What about moving the check to happen on the ScalingConfig on the driver process?
Something like:
def fit(self):
self._validate_scaling_config() # raises immediately on the driver if not possible to scheduleThere was a problem hiding this comment.
In worker_group.py, I am able to use [worker_group_context.resources_per_worker] * worker_group_context.num_workers to represent the required resource.
I am not if we can get these information accurately at the beginning of fit function.
There was a problem hiding this comment.
The scaling config does have num_workers and resources_per_worker so it should be possible.
@matthewdeng Did you have an initial idea of whether this logic should happen before the controller gets created?
One benefit of the current way is that the run would still be logged to the dashboard.
There was a problem hiding this comment.
If we go with the current method:
We need to handle the InsufficientResourcesError properly to move the controller into the ERRORED state. Right now it would just crash the controller task and exit ungracefully.
There was a problem hiding this comment.
Oh I see you did it in the other PR: https://github.com/ray-project/ray/pull/54257/files#diff-4161b23c45b953b8c90f938eb49acf72441246ee7ca70d889c1be17d967417caR43-R48
There was a problem hiding this comment.
Yeah I was thinking it would be handled within the Controller. Not super opinionated on whether it is done within the Controller object or the WorkerGroup object.
There was a problem hiding this comment.
Good call. This PR #54257 would catch the error and transfer from SCHEDULING -> ERROR.
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Done: #54455 |
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
Signed-off-by: xgui <xgui@anyscale.com>
There was a problem hiding this comment.
LGTM but will defer to @justinvyu / @matthewdeng who have merge permissions.
justinvyu
left a comment
There was a problem hiding this comment.
Thanks! Can we also add an example log in the PR description?
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/worker_group.py
Outdated
Show resolved
Hide resolved
| total_required_amount = required_amount * num_workers | ||
| available_amount = max_cluster_resources.get(resource_name, 0) |
There was a problem hiding this comment.
there's an edge case where you can't create the placement group onto the nodes even though the total is satisfied:
4 nodes with 8 CPUs --> 32 CPUs
Each worker needs 5 CPUs
Each node can only fit 1 worker --> 4 total workers
Even though 32 > 30 CPUs for 6 workers
This is not high priority since most workloads just require 1 GPU per worker, which doesn't run into this issue.
There was a problem hiding this comment.
+1 can we add a TODO for this? Though it might be more practical to wait for the validation to be completed at the placement group level instead.
There was a problem hiding this comment.
I created a ticket to track this. The more fine grained resource validation needs to be done inside placement group logic.
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
matthewdeng
left a comment
There was a problem hiding this comment.
Really cool to see this going through the ControllerError handling flow!
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: avigyabb <avigyabb@stanford.edu>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: avigyabb <avigyabb@stanford.edu>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: minerharry <miner.harry567@gmail.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: minerharry <miner.harry567@gmail.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw `WorkerGroupStartupFailedError`. This will be wrapped by [ControllerError](https://github.com/ray-project/ray/blob/3e44daaaf522d476ab75e955ca7f49ae3ffe082f/python/ray/train/v2/api/exceptions.py#L27) and be raised. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
Before waiting for the placement group to be ready, we check the cluster info to see if this placement group can be met. If not, we directly throw
WorkerGroupStartupFailedError. This will be wrapped by ControllerError and be raisen.Related issue number
Closes #49372
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.