Skip to content

[PlacementGroup]Fix destroy placement group bug#13320

Closed
ffbin wants to merge 4 commits intoray-project:masterfrom
antgroup:fix_destroy_owned_placement_group_bug
Closed

[PlacementGroup]Fix destroy placement group bug#13320
ffbin wants to merge 4 commits intoray-project:masterfrom
antgroup:fix_destroy_owned_placement_group_bug

Conversation

@ffbin
Copy link
Copy Markdown
Contributor

@ffbin ffbin commented Jan 10, 2021

Why are these changes needed?

At present, if the placement group creator actor restarts, the placement group will be destroyed. We should destroy placement group only when the placement group creator actor is dead.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

const PlacementGroupID &placement_group_id,
StatusCallback on_placement_group_removed) {
RAY_CHECK(on_placement_group_removed);
RAY_LOG(INFO) << "Removing placement group " << placement_group_id;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The printing frequency of these logs is very low. In order to facilitate the problem location, it is set to INFO.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I still feel like this log is unnecessary. Isn't it better adding a log when the placement group is actually deleted instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jan 12, 2021

Hmm I am not sure if this semantic is a good idea. Think about this scenario.

  • Create an actor
  • In the actor constructor we create a new placement group.

In this case, isn't it creating placement groups whenever we restart the actor? So if we restart 5 times, there are 4 placement groups that are not removed and retrieveable, but we don't need them because the actor created a new one.

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request change until the discussion is resolved.

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 12, 2021

In this case, isn't it creating placement groups whenever we restart the actor? So if we restart 5 times, there are 4 placement groups that are not removed and retrieveable, but we don't need them because the actor created a new one.

Hi @rkooo567 , I actually encountered this problem in the production environment, two detached actors (A and B). Actor A creates PG, and then uses PG to create actor B. Actor A is killed, causing PG to be deleted and actor B to exit without restarting. So I think this is definitely a problem that needs to be solved, but the scenario you mentioned really can't be solved very well. I generally suggest users not to create PG in the constructor, because at present, the actor constructor fails and users can't perceive it.

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 14, 2021

Hi @rkooo567 Pls help take a look, thx :)

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update a doc to specify this behavior? (saying if the actor is restarted, the placement group created from it is not removed).

It'll be also great if you add a warning not to create a placement group in a constructor.

cc @richardliaw Will this behavior change cause any issue?

const PlacementGroupID &placement_group_id,
StatusCallback on_placement_group_removed) {
RAY_CHECK(on_placement_group_removed);
RAY_LOG(INFO) << "Removing placement group " << placement_group_id;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I still feel like this log is unnecessary. Isn't it better adding a log when the placement group is actually deleted instead?

@ffbin ffbin requested a review from rkooo567 January 15, 2021 13:04
@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Jan 15, 2021

Hey @ffbin I thought more about it, and I think we shouldn't push this change. I think the scenario you mentioned is not a bug because we have the exact same behavior for actor handles. (imagine two detached actors. actor A creates a actor handle C and pass it to actor B. Current pg behavior is consistent with this).

I think we cannot "encourage" users not to create pgs in the constructor. This is a really bad UX since many users don't see docs, and this can lead them to unexpected placement group leaks which are nearly impossible to debug in their end.

I think the right solution is to keep the current behavior, but we implement this (#10508), so that users can clearly know why the actor B was killed, or we just don't fate-share actors with the placement group.

But since we have diverged opinions, I'd like to hear what @ericl thinks about this.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 15, 2021

It sounds like what you actually want here is to "detach" the placement group.

Should we just support lifetime="detached" for placement group explicitly?

@rkooo567
Copy link
Copy Markdown
Contributor

That's a good idea. Serve also asked about the need for detached placement groups (for better fault tolerance). And I think that's the best solution here I believe.

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 17, 2021

That's a good idea. Serve also asked about the need for detached placement groups (for better fault tolerance). And I think that's the best solution here I believe.

Support lifetime="detached" for placement group is a good idea, i will open a new PR for it.

@rkooo567
Copy link
Copy Markdown
Contributor

Can we close PR and instead merging the detached pg PR?

@ffbin
Copy link
Copy Markdown
Contributor Author

ffbin commented Jan 21, 2021

Close this PR and instead merging the detached pg PR.

@ffbin ffbin closed this Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants