[Core] Add fallback strategy support to placement groups by ryanaoleary · Pull Request #59024 · ray-project/ray

ryanaoleary · 2025-11-27T02:05:51Z

Description

This PR adds a fallback_strategy argument to placement groups. For Ray placement groups, fallback_strategy is a list of scheduling option dicts, where the arguments can be bundles and (optionally) bundle_label_selector, to try in order until one can be scheduled. This PR also updates the GCS scheduler to consider these fallbacks when calling ScheduleUnplacedBundles. If a fallback strategy is chosen, the bundles in the placement group table data will be updated to the selected value.

Edit: We've updated the proto fields where bundles now only shows the actively chosen bundles for scheduling, and scheduling_strategy contains the list of all scheduling options (primary bundles and fallback bundles).

Related issues

#51564

ryanaoleary · 2025-11-27T02:05:59Z

cc: @MengjinYan

gemini-code-assist

Code Review

This pull request introduces a fallback_strategy for placement groups, allowing for alternative scheduling options if the primary configuration is not feasible. The changes are well-implemented across the Python API, Cython layer, and C++ core, including updates to the GCS scheduler. The addition of comprehensive unit and integration tests ensures the new functionality is robust. The code is well-structured, and I've identified one minor opportunity for code simplification to enhance maintainability.

src/ray/gcs/gcs_placement_group.cc

python/ray/util/placement_group.py

github-actions · 2025-12-11T12:26:02Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

MengjinYan · 2025-12-11T18:29:37Z

not stale

MengjinYan

Some high level comments. Will take a more detailed look after those are resolved. Thanks!

python/ray/util/placement_group.py

python/ray/tests/test_placement_group_fallback.py

src/ray/gcs/gcs_placement_group.cc

src/ray/protobuf/common.proto

MengjinYan

Thanks for the patience! Added one high level comment.

src/ray/gcs/gcs_placement_group.cc

src/ray/common/placement_group.h

src/ray/protobuf/gcs.proto

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

MengjinYan · 2026-01-14T23:38:40Z

src/ray/gcs/gcs_placement_group.h

By the new semantics of the bundles field, I think the field should be empty in the beginning. And probably same with the strategy if we add it to the placement group scheduling options

Tried adding this and it runs into a lot of issues with the python client and the autoscaler, since they expect bundles to be populated for the pending placement group. To fix it we end up having to modify how bundles get cached and essentially return scheduling_strategy index 0 (the primary strategy) anytime bundles would get called anyway.

The change I tried to implement ended up being pretty expansive, is this something we should try to add in a follow-up PR? Or do you know if I might be missing something and there's an easier way to support the new semantics of the field.

Understood. I think it is a very involved change because the semantic changes.

The current placement group management logic in all places assumes that there is only one set of requirements for each placement group so the requirements and the state tracking are all in the bundles field.

At the same time, with the fallback semantic, while the user facing API remains the bundles will still be used for specifying the highest priority resource requirement, we changed the assumption in GCS placement group management logic and split the original bundles field in PlacementGroupTableData into:

scheduling_strategy which only contains the scheduling requirements for the bundles

and bundles which only tracks the bundle id and the corresponding node id for the currently chosen placement group. Only bundle_id and node_id needs to be populated and we might need an additional field to indicate the index of the chosen set of requirements in the scheduling_strategy

Looking into the code, I think on a high level, the semantic changes means:

The user facing APIs should remain the same in terms of the bundles field and just add fallback_options additional field:

semantics in user facing APIs (placement_group.py) shouldn't change

semantics in PlacementGroupSpecification, BundleSpecification etc. shouldn't change

On GCS side, the semantics in GcsPlacementGroup should change to the new semantics. This means:
(1) The operations related bundle field (GetBundles, GetUnplacedBundles, etc.) should be changed to the semantics of assuming this is the chosen resource requirements. cached_bundle_specs_ should be the chosen
(2) The placement group scheduling logic needs to consider all the resource requirements in scheduling_strategy when the whole placement group needs to be scheduled.
(3) The logic to sync the pending placement groups from GCS to autoscaler needs to provide the list of resource requirements

In terms of the PRs, looks like the current PR handles (2) and related changes to update the PlacementGroupTableData and PlacementGroupSpec data model. And if the current didn't change the existing behavior of having one set of requirements in bundles, I think we can add the additional changes that covers (1) and (3) in separate PRs.

Let me know whether the above makes sense or I missed some aspects in handling the bundles semantic changes.

cc: @edoakes for visibility and if you have additional comment.

Makes sense, thanks for outlining it I feel like that made it a lot more clear to implement. I went through and think with af673db I've implemented everything from (2) and part of (1). I changed bundles to remain empty until an active strategy has been selected, but updated GetBundles to return the highest priority strategy when bundles is empty so that it still works with pending PGs. I think a follow-up PR would still be needed so that this field fully follow the new semantics and returns an empty list before a strategy has been chosen (and also updates all downstream consumers to check for

For the autoscaler, I updated GetPendingGangResourceRequests since it reads from the proto to check for scheduling_strategy(0) for PENDING groups, and then in a follow-up PR that addresses the autoscaling side can have it actually consider the full list of fallback options when scheduling.

src/ray/protobuf/gcs.proto

src/ray/gcs/gcs_placement_group_scheduler.cc

python/ray/util/placement_group.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/gcs/gcs_placement_group_scheduler.cc

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/gcs/gcs_placement_group_scheduler.cc

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

…ng fallback Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

…behavior Signed-off-by: ryanaoleary <ryanaoleary@google.com>

python/ray/util/placement_group.py

src/ray/protobuf/common.proto

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-03-10T03:53:05Z

In dbf8ff5 added logic to the legacy autoscaler to read index 0 of the scheduling options when bundles is empty, this was previously causing e2e autoscaler tests to fail only for the v1 autoscaler.

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

src/ray/protobuf/common.proto

python/ray/util/placement_group.py

src/ray/gcs/gcs_placement_group_scheduler.cc

…d tests Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/protobuf/common.proto

src/ray/gcs/gcs_placement_group_scheduler.cc

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

src/ray/gcs/gcs_placement_group_scheduler.cc

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

cursor · 2026-03-18T02:19:49Z

python/ray/util/placement_group.py

+                self._active_strategy_index_cache = active_index
+            return active_index
+
+        return -1


GCS RPC on every task submission to PG

High Severity

_get_active_strategy_index calls self.wait(timeout_seconds=0) unconditionally on every invocation, which is a GCS RPC. This method is called from both check_placement_group_index and _validate_resource_shape during task/actor submission, adding at least 2 GCS RPCs per submission. Previously, validation used only cached data after the first call. The cached _active_strategy_index_cache is checked after the wait call, so even a cache hit still pays the RPC cost.

Additional Locations (2)

python/ray/util/placement_group.py#L399-L404

python/ray/util/placement_group.py#L616-L620

The options are that we either:

Keep the rpc call, since we need to be able to validate the task against the actual scheduling option that is chosen. Since the scheduling option can change if all of the PG bundles get rescheduled, we can't just cache the value once and then use it.

Validate against all scheduling options (primary + fallback), rather than the active strategy. This is more permissive and could lead to tasks being accepted that can not be scheduled on the active strategy.

Change the scheduler logic again so that we do not consider fallbacks when doing re-scheduling of all bundles. We only try all of the scheduling options (primary + fallbcak) once when initially scheduling, and from that point on the active strategy is locked. This reduces the value of fallback strategies, because if a node is preempted we don't support the PG falling back to a strategy it could schedule on, but simplifies the scheduler and Python logic.

Thanks for the detailed options!

I don't think we should add additional GRPC calls to GCS for all task/actor submission for the validity check. At the same time, I think the reconsidering all the fallback during the whole placement group scheduling is still valuable.

So I think 2 could be the best solution. And at the same time, we need to make sure to output the log message clearly in the infeasible case so that users can be aware of the situation.

Hope that makes sense.

ryanaoleary requested a review from a team as a code owner November 27, 2025 02:05

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

cursor bot reviewed Nov 27, 2025

View reviewed changes

src/ray/gcs/gcs_placement_group.cc Outdated Show resolved Hide resolved

python/ray/util/placement_group.py Show resolved Hide resolved

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 27, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 11, 2025

MengjinYan self-assigned this Dec 11, 2025

MengjinYan removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 11, 2025

MengjinYan reviewed Dec 12, 2025

View reviewed changes

python/ray/util/placement_group.py Show resolved Hide resolved

python/ray/tests/test_placement_group_fallback.py Show resolved Hide resolved

src/ray/gcs/gcs_placement_group.cc Outdated Show resolved Hide resolved

src/ray/gcs/gcs_placement_group.cc Outdated Show resolved Hide resolved

ryanaoleary mentioned this pull request Dec 18, 2025

[Serve] Add label_selector and bundle_label_selector to Serve API #57694

Merged

8 tasks

ryanaoleary requested a review from MengjinYan December 18, 2025 19:49

cursor bot reviewed Dec 18, 2025

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

MengjinYan reviewed Jan 10, 2026

View reviewed changes

src/ray/gcs/gcs_placement_group.cc Outdated Show resolved Hide resolved

ryanaoleary requested a review from MengjinYan January 14, 2026 05:21

cursor bot reviewed Jan 14, 2026

View reviewed changes

src/ray/gcs/gcs_placement_group.cc Show resolved Hide resolved

src/ray/common/placement_group.h Show resolved Hide resolved

MengjinYan reviewed Jan 15, 2026

View reviewed changes

cursor bot reviewed Jan 22, 2026

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

ryanaoleary requested a review from MengjinYan January 22, 2026 20:27

MengjinYan mentioned this pull request Jan 26, 2026

[Core] Ray Label Selector API Implementation Tracker #51564

Open

36 tasks

ryanaoleary force-pushed the pg-fallback-strategy branch from af673db to 5650ee9 Compare February 3, 2026 23:00

ryanaoleary requested review from a team, matthewdeng and richardliaw as code owners February 3, 2026 23:00

remove unused var assignment

6124ac8

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor bot reviewed Mar 4, 2026

View reviewed changes

src/ray/gcs/gcs_placement_group_scheduler.cc Show resolved Hide resolved

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits March 4, 2026 21:52

Use min for bundle index validation

3fb50b3

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

9003c3d

cursor bot reviewed Mar 4, 2026

View reviewed changes

src/ray/gcs/gcs_placement_group_scheduler.cc Show resolved Hide resolved

src/ray/gcs/gcs_placement_group_scheduler.cc Show resolved Hide resolved

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

Fix active strategy reschedule being marked infeasible without checki…

b4ecd62

…ng fallback Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor bot reviewed Mar 5, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits March 7, 2026 02:26

Validate only against primary strategy - maintain legacy user-facing …

d3bdb81

…behavior Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

66f73e8

cursor bot reviewed Mar 7, 2026

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

src/ray/protobuf/common.proto Show resolved Hide resolved

ryanaoleary and others added 2 commits March 7, 2026 11:46

Add safety check for empty scheduling options

73c78ac

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

917eac9

cursor bot reviewed Mar 7, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

Fix v1 autoscaler reading bundles from PG

dbf8ff5

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

33e2aa6

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

Merge branch 'master' into pg-fallback-strategy

5b828bf

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

Merge branch 'master' into pg-fallback-strategy

ebf2cb7

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

src/ray/gcs/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved

Update logic for pending PG to validate against all strategies and ad…

fefbeac

…d tests Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor bot reviewed Mar 17, 2026

View reviewed changes

src/ray/protobuf/common.proto Show resolved Hide resolved

src/ray/gcs/gcs_placement_group_scheduler.cc Show resolved Hide resolved

src/ray/gcs/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits March 17, 2026 22:09

allow scheduler to try fallbacks on reschedule and add tests

160a6a6

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

0478962

cursor bot reviewed Mar 17, 2026

View reviewed changes

src/ray/gcs/gcs_placement_group_scheduler.cc Outdated Show resolved Hide resolved

src/ray/gcs/gcs_placement_group_scheduler.cc Show resolved Hide resolved

ryanaoleary and others added 2 commits March 18, 2026 02:07

fix how fallbacks are tried for partial schedules / detected resources

bb429ac

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into pg-fallback-strategy

af5c8f7

cursor bot reviewed Mar 18, 2026

View reviewed changes

ryanaoleary mentioned this pull request Mar 25, 2026

[Core][1/N] Add placement group scheduling option proto and data model #62041

Open

Conversation

ryanaoleary commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

ryanaoleary commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

MengjinYan commented Dec 11, 2025

Uh oh!

MengjinYan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MengjinYan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MengjinYan Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

MengjinYan Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

ryanaoleary commented Nov 27, 2025 •

edited

Loading

ryanaoleary Mar 18, 2026 •

edited

Loading