[Core] Fix placement group leaks (#42942)#43097
Merged
aslonnie merged 1 commit intoray-project:releases/2.9.3from Feb 13, 2024
Merged
[Core] Fix placement group leaks (#42942)#43097aslonnie merged 1 commit intoray-project:releases/2.9.3from
aslonnie merged 1 commit intoray-project:releases/2.9.3from
Conversation
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
zhe-thoughts
approved these changes
Feb 12, 2024
jjyao
approved these changes
Feb 12, 2024
Contributor
Author
|
the minimal test failures are happening in the master too (it is a dependency issue). Please ignore and merge the PR! |
Contributor
|
Only failures are the known issues, not related. Force merging |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".
It fixes the issue by retrying cancellation. This also means
If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do
register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.