[Core] Fix placement group leaks (#42942) by rkooo567 · Pull Request #43097 · ray-project/ray

rkooo567 · 2024-02-12T14:29:14Z

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".

It fixes the issue by retrying cancellation. This also means

If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do

register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.

rkooo567 · 2024-02-13T05:58:13Z

the minimal test failures are happening in the master too (it is a dependency issue). Please ignore and merge the PR!

rynewang · 2024-02-13T19:10:41Z

Only failures are the known issues, not related. Force merging

rkooo567 assigned jjyao and zhe-thoughts Feb 12, 2024

rkooo567 requested a review from a team as a code owner February 12, 2024 14:29

zhe-thoughts approved these changes Feb 12, 2024

View reviewed changes

jjyao approved these changes Feb 12, 2024

View reviewed changes

rkooo567 assigned rynewang Feb 12, 2024

aslonnie merged commit a4a43c1 into ray-project:releases/2.9.3 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix placement group leaks (#42942)#43097

[Core] Fix placement group leaks (#42942)#43097
aslonnie merged 1 commit intoray-project:releases/2.9.3from
rkooo567:releases/2.9.3

rkooo567 commented Feb 12, 2024

Uh oh!

rkooo567 commented Feb 13, 2024

Uh oh!

rynewang commented Feb 13, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rkooo567 commented Feb 12, 2024

Why are these changes needed?

Related issue number

Checks

Uh oh!

rkooo567 commented Feb 13, 2024

Uh oh!

rynewang commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rynewang commented Feb 13, 2024 •

edited

Loading