Skip to content

Add a flag to disable reconstruction for a killed actor#7346

Merged
raulchen merged 12 commits intoray-project:masterfrom
antgroup:java_kill_actor_fix
Mar 13, 2020
Merged

Add a flag to disable reconstruction for a killed actor#7346
raulchen merged 12 commits intoray-project:masterfrom
antgroup:java_kill_actor_fix

Conversation

@kfstorm
Copy link
Copy Markdown
Member

@kfstorm kfstorm commented Feb 27, 2020

Add local_raylet_client_->Disconnect() in CoreWorker::HandleKillActor to mark the actor as dead instead of reconstructing, if the no_reconstruction flag is set to true.

Currently the flag is exposed in Java only.

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22489/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22492/
Test FAILed.

@kfstorm kfstorm changed the title [Java] A killed actor should not be reconstructed. A killed actor should not be reconstructed. Feb 27, 2020
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Feb 27, 2020

@kfstorm I'm not sure about this change actually. KillActor is meant for situations in which the actor is hanging and the user might want to kill a hanging actor and have it reconstructed to be usable again. If you want the actor not to be reconstructed, it should go through the normal actor termination path (in python, sending actor.__ray_terminate__.remote()

@jovany-wang
Copy link
Copy Markdown
Contributor

jovany-wang commented Feb 28, 2020

ps:Not related to this PR.

I'don't know if it's too late to suggest renaming it to exitActorGracefully or some words which can indicates the API is used for killing actor with not reconstructing. I guess it would be more obvious.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Feb 28, 2020

@kfstorm based on the discussion on #7360, could you make this optional? That is, add a flag (that's off by default) that calls local_raylet_client_->Disconnect().

@mitar
Copy link
Copy Markdown
Member

mitar commented Feb 29, 2020

I think the best would be to have API:

  • ray.kill_actor (calls current kill + disconnect so that it is not restarted)
  • ray.restart_actor (calls current kill)
  • ray.terminate_actor (calls __ray_terminate__)

@kfstorm
Copy link
Copy Markdown
Member Author

kfstorm commented Mar 1, 2020

@edoakes Sounds good. I'll update this PR. @mitar What's the semantics of invoking ray.restart_actor against a non-reconstructable actor?

@mitar
Copy link
Copy Markdown
Member

mitar commented Mar 1, 2020

What's the semantics of invoking ray.restart_actor against a non-reconstructable actor?

It could be that it re-creates with initial state? Instead of reconstructed state? (I am assuming what reconstructable actor is, I am not really familiar with that.)

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Mar 1, 2020

@kfstorm I don't think we want to expose this in the python API without more discussion about the API/semantics, so maybe add the flag but don't make it public yet?

@kfstorm kfstorm changed the title A killed actor should not be reconstructed. Add a flag to disable reconstruction for a killed actor Mar 4, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22717/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22734/
Test PASSed.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Mar 4, 2020

@kfstorm looks good, but see my comment on #7434.

*/
public static void killActor(RayActor<?> actor) {
runtime.killActor(actor);
public static void killActor(RayActor<?> actor, boolean noReconstruction) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this API to RayActor?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public static class KillerActor {

public void kill(RayActor<?> actor, boolean noReconstruction) {
Ray.killActor(actor, noReconstruction);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the new API, can you rebase and update this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


worker = ray.worker.get_global_worker()
worker.core_worker.kill_actor(actor._ray_actor_id)
worker.core_worker.kill_actor(actor._ray_actor_id, False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose an option to this kill method? It can default to False.

Copy link
Copy Markdown
Contributor

@raulchen raulchen Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes just saw your comment above. I think it makes more sense to add an option here and deprecate __ray_terminate__.

Because, 1) having two different APIs (kill and __ray_terminate__) is confusing to users. 2) __ray_terminate__ is also an actor method, if the actor is hanging, it won't take effect.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having ray.kill and actor.__ray_terminate__ is confusing. They have similar effects but with different approaches. We can combine them into one API.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree - ray.kill with a "force kill" flag seems best to me

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22794/
Test FAILed.

Comment on lines -980 to +982
RAY_CHECK_OK(KillActor(actor_id, /*intentional=*/true));
RAY_CHECK_OK(
KillActor(actor_id, /*force_kill=*/true, /*no_reconstruction=*/false));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang Maybe the second argument should be false? Do you try to destroy the actor gracefully?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23034/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23061/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23090/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23092/
Test FAILed.

@raulchen
Copy link
Copy Markdown
Contributor

Can you merge master and resolve the conflicts?

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23138/
Test PASSed.

@raulchen raulchen merged commit d6e8f47 into ray-project:master Mar 13, 2020
@raulchen raulchen deleted the java_kill_actor_fix branch March 13, 2020 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants