Skip to content

[Core] Fix placement group leaks (#42942)#43097

Merged
aslonnie merged 1 commit intoray-project:releases/2.9.3from
rkooo567:releases/2.9.3
Feb 13, 2024
Merged

[Core] Fix placement group leaks (#42942)#43097
aslonnie merged 1 commit intoray-project:releases/2.9.3from
rkooo567:releases/2.9.3

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen
Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone.
register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown.
I chose the current solution as it is needed for a network issue as well.
@rkooo567 rkooo567 requested a review from a team as a code owner February 12, 2024 14:29
@rkooo567
Copy link
Copy Markdown
Contributor Author

the minimal test failures are happening in the master too (it is a dependency issue). Please ignore and merge the PR!

@rynewang
Copy link
Copy Markdown
Contributor

rynewang commented Feb 13, 2024

Only failures are the known issues, not related. Force merging

@aslonnie aslonnie merged commit a4a43c1 into ray-project:releases/2.9.3 Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants