[Core] Cancel lease requests before returning a PG bundle#45919
Merged
jjyao merged 9 commits intoray-project:masterfrom Jun 16, 2024
Merged
[Core] Cancel lease requests before returning a PG bundle#45919jjyao merged 9 commits intoray-project:masterfrom
jjyao merged 9 commits intoray-project:masterfrom
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
jjyao
commented
Jun 13, 2024
Comment on lines
+686
to
+695
| // Cancel lease requests related to unused bundles | ||
| cluster_task_manager_->CancelTasks( | ||
| [&](const RayTask &task) { | ||
| const auto bundle_id = task.GetTaskSpecification().PlacementGroupBundleId(); | ||
| return !bundle_id.first.IsNil() && 0 == in_use_bundles.count(bundle_id); | ||
| }, | ||
| rpc::RequestWorkerLeaseReply::SCHEDULING_CANCELLED_INTENDED, | ||
| "The task is cancelled because it uses placement group bundles that are not " | ||
| "registered to GCS. It can happen upon GCS restart."); | ||
|
|
Comment on lines
+1902
to
+1910
| // Cancel lease requests related to the placement group to be removed. | ||
| cluster_task_manager_->CancelTasks( | ||
| [&](const RayTask &task) { | ||
| const auto bundle_id = task.GetTaskSpecification().PlacementGroupBundleId(); | ||
| return bundle_id.first == bundle_spec.PlacementGroupId(); | ||
| }, | ||
| rpc::RequestWorkerLeaseReply::SCHEDULING_CANCELLED_PLACEMENT_GROUP_REMOVED, | ||
| ""); | ||
|
|
Contributor
|
Side comment: we found multiple similar cases, in that when we want to kill all workers under a predicate (e.g. job died, root detached actor died, pg died, ...), we have to do it multiple places:
I wonder if we can have a unified mechanism to do the killing for all... |
jjyao
commented
Jun 13, 2024
| thread.join(); | ||
|
|
||
| RayConfig::instance().initialize(promise.get_future().get()); | ||
| ray::asio::testing::init(); |
Contributor
Author
There was a problem hiding this comment.
This makes sure we can set testing_asio_delay_us env var through _system_configs
rynewang
reviewed
Jun 13, 2024
rynewang
approved these changes
Jun 14, 2024
Contributor
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
To successfully return a PG bundle (
CancelResourceReserveandReleaseUnusedBundles), the bundle resource needs to be completely free (i.e. total == available). To make sure that, raylet will first destroy leased workers that are currently using the PG bundle resources so that these bundle resources can be freed. However this alone cannot guarantee that all bundle resources will be freed since a lease request that's popping worker also already acquires the bundle resources so we need to cancel these lease requests as well to free the bundle resources.After this PR, we don't need the retry in #42942.
Related issue number
Closes #45642
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.