[Core] Add fallback strategy support to placement groups#59024
[Core] Add fallback strategy support to placement groups#59024ryanaoleary wants to merge 57 commits intoray-project:masterfrom
Conversation
|
cc: @MengjinYan |
There was a problem hiding this comment.
Code Review
This pull request introduces a fallback_strategy for placement groups, allowing for alternative scheduling options if the primary configuration is not feasible. The changes are well-implemented across the Python API, Cython layer, and C++ core, including updates to the GCS scheduler. The addition of comprehensive unit and integration tests ensures the new functionality is robust. The code is well-structured, and I've identified one minor opportunity for code simplification to enhance maintainability.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
not stale |
MengjinYan
left a comment
There was a problem hiding this comment.
Some high level comments. Will take a more detailed look after those are resolved. Thanks!
MengjinYan
left a comment
There was a problem hiding this comment.
Thanks for the patience! Added one high level comment.
src/ray/gcs/gcs_placement_group.h
Outdated
There was a problem hiding this comment.
By the new semantics of the bundles field, I think the field should be empty in the beginning. And probably same with the strategy if we add it to the placement group scheduling options
There was a problem hiding this comment.
Tried adding this and it runs into a lot of issues with the python client and the autoscaler, since they expect bundles to be populated for the pending placement group. To fix it we end up having to modify how bundles get cached and essentially return scheduling_strategy index 0 (the primary strategy) anytime bundles would get called anyway.
The change I tried to implement ended up being pretty expansive, is this something we should try to add in a follow-up PR? Or do you know if I might be missing something and there's an easier way to support the new semantics of the field.
There was a problem hiding this comment.
Understood. I think it is a very involved change because the semantic changes.
The current placement group management logic in all places assumes that there is only one set of requirements for each placement group so the requirements and the state tracking are all in the bundles field.
At the same time, with the fallback semantic, while the user facing API remains the bundles will still be used for specifying the highest priority resource requirement, we changed the assumption in GCS placement group management logic and split the original bundles field in PlacementGroupTableData into:
scheduling_strategywhich only contains the scheduling requirements for the bundles- and
bundleswhich only tracks the bundle id and the corresponding node id for the currently chosen placement group. Onlybundle_idandnode_idneeds to be populated and we might need an additional field to indicate the index of the chosen set of requirements in thescheduling_strategy
Looking into the code, I think on a high level, the semantic changes means:
- The user facing APIs should remain the same in terms of the
bundlesfield and just addfallback_optionsadditional field:- semantics in user facing APIs (
placement_group.py) shouldn't change - semantics in
PlacementGroupSpecification,BundleSpecificationetc. shouldn't change
- semantics in user facing APIs (
- On GCS side, the semantics in
GcsPlacementGroupshould change to the new semantics. This means:
(1) The operations relatedbundlefield (GetBundles, GetUnplacedBundles, etc.) should be changed to the semantics of assuming this is the chosen resource requirements.cached_bundle_specs_should be the chosen
(2) The placement group scheduling logic needs to consider all the resource requirements inscheduling_strategywhen the whole placement group needs to be scheduled.
(3) The logic to sync the pending placement groups from GCS to autoscaler needs to provide the list of resource requirements
In terms of the PRs, looks like the current PR handles (2) and related changes to update the PlacementGroupTableData and PlacementGroupSpec data model. And if the current didn't change the existing behavior of having one set of requirements in bundles, I think we can add the additional changes that covers (1) and (3) in separate PRs.
Let me know whether the above makes sense or I missed some aspects in handling the bundles semantic changes.
cc: @edoakes for visibility and if you have additional comment.
There was a problem hiding this comment.
Makes sense, thanks for outlining it I feel like that made it a lot more clear to implement. I went through and think with af673db I've implemented everything from (2) and part of (1). I changed bundles to remain empty until an active strategy has been selected, but updated GetBundles to return the highest priority strategy when bundles is empty so that it still works with pending PGs. I think a follow-up PR would still be needed so that this field fully follow the new semantics and returns an empty list before a strategy has been chosen (and also updates all downstream consumers to check for
For the autoscaler, I updated GetPendingGangResourceRequests since it reads from the proto to check for scheduling_strategy(0) for PENDING groups, and then in a follow-up PR that addresses the autoscaling side can have it actually consider the full list of fallback options when scheduling.
af673db to
5650ee9
Compare
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…ng fallback Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…behavior Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
In dbf8ff5 added logic to the legacy autoscaler to read index 0 of the scheduling options when bundles is empty, this was previously causing e2e autoscaler tests to fail only for the v1 autoscaler. |
…d tests Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
| self._active_strategy_index_cache = active_index | ||
| return active_index | ||
|
|
||
| return -1 |
There was a problem hiding this comment.
GCS RPC on every task submission to PG
High Severity
_get_active_strategy_index calls self.wait(timeout_seconds=0) unconditionally on every invocation, which is a GCS RPC. This method is called from both check_placement_group_index and _validate_resource_shape during task/actor submission, adding at least 2 GCS RPCs per submission. Previously, validation used only cached data after the first call. The cached _active_strategy_index_cache is checked after the wait call, so even a cache hit still pays the RPC cost.
Additional Locations (2)
There was a problem hiding this comment.
The options are that we either:
- Keep the rpc call, since we need to be able to validate the task against the actual scheduling option that is chosen. Since the scheduling option can change if all of the PG bundles get rescheduled, we can't just cache the value once and then use it.
- Validate against all scheduling options (primary + fallback), rather than the active strategy. This is more permissive and could lead to tasks being accepted that can not be scheduled on the active strategy.
- Change the scheduler logic again so that we do not consider fallbacks when doing re-scheduling of all bundles. We only try all of the scheduling options (primary + fallbcak) once when initially scheduling, and from that point on the active strategy is locked. This reduces the value of fallback strategies, because if a node is preempted we don't support the PG falling back to a strategy it could schedule on, but simplifies the scheduler and Python logic.
There was a problem hiding this comment.
Thanks for the detailed options!
I don't think we should add additional GRPC calls to GCS for all task/actor submission for the validity check. At the same time, I think the reconsidering all the fallback during the whole placement group scheduling is still valuable.
So I think 2 could be the best solution. And at the same time, we need to make sure to output the log message clearly in the infeasible case so that users can be aware of the situation.
Hope that makes sense.


Description
This PR adds a
fallback_strategyargument to placement groups. For Ray placement groups,fallback_strategyis a list of scheduling option dicts, where the arguments can bebundlesand (optionally)bundle_label_selector, to try in order until one can be scheduled. This PR also updates the GCS scheduler to consider these fallbacks when callingScheduleUnplacedBundles. If a fallback strategy is chosen, the bundles in the placement group table data will be updated to the selected value.Edit: We've updated the proto fields where
bundlesnow only shows the actively chosen bundles for scheduling, andscheduling_strategycontains the list of all scheduling options (primary bundles and fallback bundles).Related issues
#51564