[Core] Add fallback_strategy API to Task/Actor remote options#56374
[Core] Add fallback_strategy API to Task/Actor remote options#56374edoakes merged 14 commits intoray-project:masterfrom
fallback_strategy API to Task/Actor remote options#56374Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fallback_strategy API to the remote options for Tasks and Actors, enhancing scheduling flexibility. The implementation is well-structured, adding validation logic and comprehensive tests for the new feature. My review focuses on improving documentation consistency and correcting a minor inaccuracy in a test comment to ensure clarity and maintainability.
|
There seems to be lint failure. The java test failure should be unrelated. cc: @ryanaoleary |
There were some pydoc lint failures, should be fixed with d377805 |
daed44c to
fe58310
Compare
|
Test failures @ryanaoleary |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Update python/ray/actor.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Update python/ray/actor.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Update python/ray/tests/test_label_scheduling.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Update python/ray/_private/worker.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Add remaining changes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove arg in create_actor Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Update python/ray/actor.py Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Update python/ray/actor.py Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Update python/ray/_private/worker.py Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> fix lint and pydoc Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Undo pydoclint change Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Update pydoc baseline Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix label util function Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix test and lint Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix remote function pydoc Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Undo pydoc lint baseline changes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix baseline Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix pydoc lint errors Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Re-run linters and regenerate baseline Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
fe58310 to
4f943a3
Compare
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
@MengjinYan I think this PR is ready to merge since CI is now passing. |
|
@MengjinYan @ryanaoleary from the REP, |
@edoakes Sorry I must have misread it when I was implementing it, so rather than a where it's just a list of label selectors that we iterate through to try, it should be formatted like this: If the above looks correct I'll fix this PR to follow that format asap. |
|
@ryanaoleary yes that's right. And the reason for this is to open the door to supporting coordinated fallback with other overrides (like resources) in the future. |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Sounds good, should be fixed with 9cca330. |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
@ryanaoleary could you update the RP description with a few examples of how |
Sounds good, updated the PR description with an example and explanation. |
|
The failing CI test appears unrelated and passes when I run it locally with this PR: |
|
merged master to re-run CI, enabling auto-merge |
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
This PR contains only the python changes from #56369, adding
fallback_strategyas an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option islabel_selector.Example using
fallback_strategyto schedule on different instance types:In the above field, first the
label_selectorfield will be tried. Then, the scheduler will iterate through each dict infallback_strategyand attempt to scheduling using the label selector specified there (first{"instance_type": "m5.large"}and then the empty set). The first satisfiedlabel_selectorwill be scheduled.Related issue number
#51564
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Introduces experimental
fallback_strategy(list of label selectors) for task/actor scheduling, with validation and unit tests.fallback_strategyto_common_optionsinpython/ray/_common/ray_option_utils.py; implementvalidate_fallback_strategyinpython/ray/_private/label_utils.py.ray.remoteoverload and docstrings inpython/ray/_private/worker.py,python/ray/actor.py, andpython/ray/remote_function.pyto accept and propagatefallback_strategyalongsidelabel_selector.test_decorator_fallback_strategy_argsinpython/ray/tests/test_actor.pyandtest_validate_fallback_strategyinpython/ray/tests/test_label_utils.py.ci/lint/pydoclint-baseline.txtfor new/changed docstrings.Written by Cursor Bugbot for commit 9ef798c. This will update automatically on new commits. Configure here.