Skip to content

[Core] Add fallback strategy scheduling logic#56369

Merged
edoakes merged 36 commits intoray-project:masterfrom
ryanaoleary:fallback-selector-api
Oct 30, 2025
Merged

[Core] Add fallback strategy scheduling logic#56369
edoakes merged 36 commits intoray-project:masterfrom
ryanaoleary:fallback-selector-api

Conversation

@ryanaoleary
Copy link
Copy Markdown
Contributor

@ryanaoleary ryanaoleary commented Sep 9, 2025

Why are these changes needed?

This PR also updates the cluster resource scheduler logic to account for the list of LabelSelectors specified by the fallback_strategy, falling back to each fallback strategy LabelSelector in-order until one is satisfied when selecting the best node. We're able to support fallback selectors by considering them in the cluster resource scheduler in-order using the existing label selector logic in IsFeasible and IsAvailable, returning the first valid node returned by GetBestSchedulableNode.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary marked this pull request as ready for review September 9, 2025 12:04
@ryanaoleary ryanaoleary requested review from a team, edoakes and jjyao as code owners September 9, 2025 12:04
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

The python changes are in this separate PR: #56374 which can be merged first. This PR will then contain only the C++ scheduling logic changes. cc: @MengjinYan

@ryanaoleary ryanaoleary changed the title [Core] Add fallback strategy API and scheduling logic [Core] Add fallback strategy scheduling logic Sep 9, 2025
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Sep 9, 2025
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at the tests yet and there are also seems to be conflicts that need to be fixed.

@jjyao
Copy link
Copy Markdown
Contributor

jjyao commented Sep 22, 2025

@ryanaoleary could you rebase?

cursor[bot]

This comment was marked as outdated.

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Sep 24, 2025
@edoakes edoakes removed the community-contribution Contributed by the community label Sep 26, 2025
edoakes added a commit that referenced this pull request Oct 7, 2025
This PR contains only the python changes from
#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@ryanaoleary ryanaoleary requested a review from MengjinYan October 9, 2025 21:23
@ryanaoleary ryanaoleary force-pushed the fallback-selector-api branch from ed07570 to 7a0f94b Compare October 9, 2025 21:24
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

cc: @MengjinYan rebased and fixed all the merge conflicts

joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Oct 13, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
cursor[bot]

This comment was marked as outdated.

justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Add tests and fix scheduling logic

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

remove cgroup change

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Fix merge

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@MengjinYan
Copy link
Copy Markdown
Contributor

cc: @edoakes @jjyao in case you'd like to take a look as well.

@ryanaoleary
Copy link
Copy Markdown
Contributor Author

The CI test failure seems to be legit though:

[2025-10-27T21:40:02Z] =================================== FAILURES ===================================
--
  | [2025-10-27T21:40:02Z] ____________ test_task_scheduled_on_node_with_label_selector[True] _____________

...

[2025-10-27T21:40:02Z] E               AssertionError: Task 'task_1' has an incorrect label selector. Expected: {'region': 'in(us-east1,me-central1)'}, Got: {'region': 'in(me-central1,us-east1)'}
--
  | [2025-10-27T21:40:02Z] E               assert {'region': 'i...l1,us-east1)'} == {'region': 'i...me-central1)'}
  | [2025-10-27T21:40:02Z] E                 Differing items:
  | [2025-10-27T21:40:02Z] E                 {'region': 'in(me-central1,us-east1)'} != {'region': 'in(us-east1,me-central1)'}
  | [2025-10-27T21:40:02Z] E                 Full diff:
  | [2025-10-27T21:40:02Z] E                 - {'region': 'in(us-east1,me-central1)'}
  | [2025-10-27T21:40:02Z] E                 ?                ---------
  | [2025-10-27T21:40:02Z] E                 + {'region': 'in(me-central1,us-east1)'}
  | [2025-10-27T21:40:02Z] E                 ?                           +++++++++

Oh this is because we sort the values inside in() in ToStringMap when passing the message back from the scheduler. I can remove the sorting part and the test should pass.

Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 1 minor comments on the tests

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…llbackOption

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

ryanaoleary commented Oct 28, 2025

+++++++++

I think we actually want the sorting since the values are stored in an unordered set, but I fixed the tests to account for this in: bdf0ef8.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one minor followup question on the ToProto function.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

That makes sense. Thanks for explaining!
And if thats the case, wondering if we should also update the ToProto function in LabelSelector to be with output parameter?

Yeah I think we should, since we're using writing to the label selector field in a larger Proto and this will avoid copies. Done in
8376578 and re-tested, the CI failure seemed unrelated in test_gcs_fault_tolerance.py so I'll update the branch and try again.

ryanaoleary and others added 4 commits October 29, 2025 01:51
@MengjinYan
Copy link
Copy Markdown
Contributor

cc: @jjyao @edoakes

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Only nits that can be addressed in follow up PRs.



def test_fallback_strategy(cluster_with_labeled_nodes):
# Create a RayCluster with labelled nodes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create a RayCluster with labelled nodes.
# Create a RayCluster with labeled nodes.

).remote()

# Assert that the actor was scheduled on the expected node.
assert ray.get(label_selector_actor.get_node_id.remote(), timeout=5) == gpu_node
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is a little tight (CI can be slow), would loosen it

Comment on lines +142 to +147
# Assert that the actor was scheduled on the expected node.
assert ray.get(label_selector_actor.get_node_id.remote(), timeout=5) in {
node_1,
node_2,
node_3,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making the test more deterministic by scheduling 3 actors in parallel, each that occupies all CPUs on each node, and asserting that all 3 nodes are occupied by one of the actors

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests, thanks!

Comment on lines +297 to +298
std::vector<std::reference_wrapper<const LabelSelector>> label_selectors;
label_selectors.push_back(std::cref(lease_spec.GetLabelSelector()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen reference_wrapper and std::cref before -- is this just a special way to have a vector of const refs and avoid copying into the vector?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, std::vector<const LabelSelector&> wouldn't compile so I found those helpers. I only wanted to store the references to avoid copying unnecessarily into 1 list when we already had the original objects.

requires_object_store_memory);

// Use the label selector from the highest-priority fallback that was feasible.
// There must be at least one feasible node and selector.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// There must be at least one feasible node and selector.
// There must be at least one feasible node and selector, else we would have returned early above.

@edoakes edoakes merged commit ed49a53 into ray-project:master Oct 30, 2025
6 checks passed
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

Looks great. Only nits that can be addressed in follow up PRs.

Nice thank you, I'll fix the typo/comment and update the tests in a follow-up PR.

YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…project#56374)

This PR contains only the python changes from
ray-project#56369, adding
`fallback_strategy` as an option to the remote decorator of
Tasks/Actors. Fallback strategy consists of a list of dict of decorator
options. The dict of decorator options are evaluated together, and the
first satisfied strategy dict is scheduled. With this PR, the only
supported option is `label_selector`.

Example using `fallback_strategy` to schedule on different instance
types:
```
@ray.remote(
    label_selector={"instance_type": "m5.16xlarge"},
    fallback_strategy=[
        # Fall back to selector for a "m5.large" instance type if "m5.16xlarge"
        # cannot be satisfied.
        {"label_selector": {"instance_type": "m5.large"}},
        # Finally, fall back to an empty set of labels (no constraints).
        # neither desired m5 type can be sastisfied.
        {"label_selector": {}},
    ],
)
class A:
    pass
```

In the above field, first the `label_selector` field will be tried.
Then, the scheduler will iterate through each dict in
`fallback_strategy` and attempt to scheduling using the label selector
specified there (first `{"instance_type": "m5.large"}` and then the
empty set). The first satisfied `label_selector` will be scheduled.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants