[train] Use FailurePolicy to handle resize failure#54257
[train] Use FailurePolicy to handle resize failure#54257xinyuangui2 wants to merge 24 commits intoray-project:masterfrom
Conversation
04dd54e to
bfb7e42
Compare
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/protocol.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/protocol.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/protocol.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/protocol.py
Outdated
Show resolved
Hide resolved
justinvyu
left a comment
There was a problem hiding this comment.
We need to change TrainingFailedError to also accept the controller-level worker group scheduling error, and I think it's better to actually decouple the 2 status classes. cc @matthewdeng
Also, note that this PR adds a new state controller transition from SCHEDULING -> ERRORED. (previously it only goes from RUNNING -> ERRORED.
python/ray/train/v2/_internal/execution/worker_group/protocol.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
Good call. For now I am hacking by setting it to {0: controller_error}. Added a TODO on it. |
python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
08048e6 to
44c25d6
Compare
matthewdeng
left a comment
There was a problem hiding this comment.
Looking pretty good!
python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
In the current implementation there is a 1:1 mapping between TrainingFailedError:RESTART and ControllerError:RESCHEDULE, but I don't think we want to enforce this - generally the FailurePolicy should own the entirety of the decision making/validation logic. I can imagine that in the future there are cases where ControllerError can return RESTART as well.
60c463c to
c8b7472
Compare
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
c8b7472 to
73e1f5f
Compare
…#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see #54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: dshepelev15 <d-shepelev@list.ru>
|
Moved to #54833 |
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
This PR enables the
FailurePolicyto handle worker group resize/startup failures, instead of retrying indefinitely. Previously, startup errors (WorkerGroupStartupTimeoutError,WorkerGroupStartupFailedError) would always retry without limit, ignoring the configured failure policy.Changes:
_execute_failure_decision, they will receive eithertraining_failed_errororcontroller_failed_error. These 2 parameters would help them decide the next state.training_failed_errorandcontroller_failed_error.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.