Skip to content

[PlacementGroup]Fix the bug that actor does not restart#13321

Closed
ffbin wants to merge 1 commit intoray-project:masterfrom
antgroup:fix_placement_group_destroy_worker_bug
Closed

[PlacementGroup]Fix the bug that actor does not restart#13321
ffbin wants to merge 1 commit intoray-project:masterfrom
antgroup:fix_placement_group_destroy_worker_bug

Conversation

@ffbin
Copy link
Copy Markdown
Contributor

@ffbin ffbin commented Jan 10, 2021

Why are these changes needed?

If an actor uses a placeemt group bundle, the actor will not restart when the bundle is cancelled.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ffbin ffbin requested a review from rkooo567 January 10, 2021 12:18
@rkooo567
Copy link
Copy Markdown
Contributor

I think we discussed this before, but this is intentional. Imagine your placement group is removed, and that destroys the actor. In this case, restarting actor doesn't mean anything because it can never be scheduled.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2021
@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 15, 2021

I think we discussed this before, but this is intentional. Imagine your placement group is removed, and that destroys the actor. In this case, restarting actor doesn't mean anything because it can never be scheduled.

If GCS restarts during PG creation, GCS will release useless bundles, which will cause actor not to restart.

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jan 15, 2021

Hmm I dont' think I follow this sentence; If GCS restarts during PG creation, GCS will release useless bundles, which will cause actor not to restart.

But regardless, this change is dangerous. This will make the user experience really bad (because whenever user removes the placement group, they will get tons of spam messages about unintentional worker crashes)

@rkooo567
Copy link
Copy Markdown
Contributor

why is the actor scheduled on a useless bundle in the first place?

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 17, 2021

why is the actor scheduled on a useless bundle in the first place?

PG and actor are created asynchronously. If GCS restarts in the process of creating a PG, in order to avoid PG bundle resource leakage, GCS will inform raylet to release the bundles not recorded in GCS. However, the bundle may already be used by an actor, so canceling the bundle will cause the actor to exit internally and not restart again.

@rkooo567
Copy link
Copy Markdown
Contributor

It is still unacceptable to have this change because fate-sharing is a feature of pg, and it is definitely intentional (and otherwise we will spam users with wrong info). What about this? We modify the remove bundle signature to have “intentional_bundle_removal” default true, and when we delete bundles cuz of pg fault tolernace protocol, we set this false?

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 17, 2021

@rkooo567 #13320 and this PR will cause the following problems: detached actor A creates PG and uses PG to create detached actor B. If actor A restarts, it will cause PG to be removed, and PG will kill actor B when it is deleted. Since actor B is marked as an internal exit, actor B will not be restarted.

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 17, 2021

It is still unacceptable to have this change because fate-sharing is a feature of pg, and it is definitely intentional (and otherwise we will spam users with wrong info). What about this? We modify the remove bundle signature to have “intentional_bundle_removal” default true, and when we delete bundles cuz of pg fault tolernace protocol, we set this false?

SGTM, My current implementation is a temporary solution, mainly to quickly solve the problems mentioned above.

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 17, 2021

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 20, 2021

I think #13527 will fix this bug, so close this PR.

@ffbin ffbin closed this Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants