Skip to content

[core] Ref counting for actor handles#7434

Merged
stephanie-wang merged 23 commits intoray-project:masterfrom
stephanie-wang:actor-handle-ref-counting
Mar 11, 2020
Merged

[core] Ref counting for actor handles#7434
stephanie-wang merged 23 commits intoray-project:masterfrom
stephanie-wang:actor-handle-ref-counting

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Before, actors were kept alive as long as the creator's reference to the actor was active. This meant that if the creator exited, the actor might exit even though other handles to the actor were still active.

This PR reuses the ref counting for normal ObjectIDs to implement ref counting for actor handles by adding an ObjectID to the Python ActorHandle object. After this PR, an actor will be kept alive as the process that created it is still alive and there are any references to the actor's handle in scope.

Because distributed reference counting (#6945) is still feature-flagged off, this will only cover cases where an actor handle is passed by the original creator. If the actor handle is passed by a task/actor that did not originally create it, then the actor may exit early.

Related issue number

Doesn't fix it yet, but one step closer to #6370
Closes #3472

Checks

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

status = local_raylet_client_->SubmitTask(task_spec);
}

*actor_object_id = return_ids[0];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that this value will always be the same as ObjectID::ForTaskReturn(TaskID::ForActorCreationTask(actor_id), 0)? I didn't see any random bits in this.

If so, maybe we can generate the id on the fly instead of storing it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm yeah I didn't see that function. That should work for C++.

We'll still have to store it in Python, though, since we're relying on the ObjectID's python ref to track when the local ref goes out of scope and when the handle gets passed into a task.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22692/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22693/
Test FAILed.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Mar 4, 2020 via email

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

Could you do the callback when the actor handle goes out of scope and treat it as the id?

Yes, but that won't cover the cases where the actor handle gets passed or nested inside other objects. If you have suggestions on how to do it, I can try it, but this seemed much simpler to me.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Mar 4, 2020 via email

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22714/
Test FAILed.

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

stephanie-wang commented Mar 4, 2020 via email

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

The changes to the Exit() logic look nice too - might be better to separate that into a different PR though. I think @kfstorm was going to add the same logic in #7346. Maybe you could separate the Exit() logic here into a separate PR and then both this and #7346 can rebase off of that?

LMK when you update with change that Eric suggested and I'll look again.

Comment on lines +783 to +785
state = worker.core_worker.serialize_actor_handle(
self._ray_actor_id)
state = (state, self._ray_actor_creation_return_id)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
state = worker.core_worker.serialize_actor_handle(
self._ray_actor_id)
state = (state, self._ray_actor_creation_return_id)
state = (worker.core_worker.serialize_actor_handle(
self._ray_actor_id), self._ray_actor_creation_return_id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool you can do multi-line suggestions now!

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

@edoakes I think the overlap with #7346 looks small enough that I'd rather just leave the changes in this PR and resolve the conflict later, if that's okay with you.

This PR is ready to review again.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22758/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22760/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22777/
Test FAILed.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Mar 5, 2020

Sure that's fine - it's nice to try to separate independent/standalone changes into separate PRs whenever possible but obviously not a big deal in this case given that both changes are relatively small. Having another look now.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good except for the comment I left about overwriting the pending force kills.

Also, one thing to keep in mind is it might be nice to extend the changes here to add an API to check when an actor creation task has finished without submitting a new task. Are the currently changes compatible with that? Seems like we just need to be able to resolve the object ID associated with an actor handle in the same way that we do for serialized object IDs. This actually might have been easier with the previous iteration that stored the object ID :/

return obj._serialization_helper(True)
serialized, actor_handle_id = obj._serialization_helper()
# Update ref counting for the actor handle
if self.is_in_band_serialization():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code block is verbatim the same as in the object_id_serializer - might be nice to separate into another method so they maintain parity?

bool force_kill) {
absl::MutexLock lock(&mu_);
pending_force_kills_.insert(actor_id);
pending_force_kills_[actor_id] = force_kill;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, I think it would cause issues for the following example:

a = Actor.remote()
hanging_id = a.hang_forever.remote()
ray.kill(a)
del a
ray.get(hanging_id)

In this case the user would expect hanging_id to return an error because it was cancelled by the force kill, but the code here may overwrite the pending force kill with a graceful one, leading the actor to hang.

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

Everything looks good except for the comment I left about overwriting the pending force kills.

Also, one thing to keep in mind is it might be nice to extend the changes here to add an API to check when an actor creation task has finished without submitting a new task. Are the currently changes compatible with that? Seems like we just need to be able to resolve the object ID associated with an actor handle in the same way that we do for serialized object IDs. This actually might have been easier with the previous iteration that stored the object ID :/

Er I'm not sure if I totally understand, but yes, I was thinking that eventually we would like to do something similar where we "resolve" an actor handle by asking its owner about the status of the actor creation task. I was actually thinking this separation would be good so that it's clear what the differences are between resolving actor handles vs normal objects.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22784/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22786/
Test FAILed.

@zhijunfu zhijunfu self-requested a review March 6, 2020 00:44
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should wait for @zhijunfu to have a look before merging.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22791/
Test FAILed.

@zhijunfu
Copy link
Copy Markdown
Contributor

zhijunfu commented Mar 6, 2020

LGTM! Should wait for @zhijunfu to have a look before merging.

Thanks, took a quick look ant it looks good to me.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22823/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22905/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22913/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22973/
Test FAILed.

@stephanie-wang stephanie-wang merged commit fdb5285 into ray-project:master Mar 11, 2020
@stephanie-wang stephanie-wang deleted the actor-handle-ref-counting branch March 11, 2020 00:46
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22982/
Test FAILed.

@jovany-wang
Copy link
Copy Markdown
Contributor

jovany-wang commented Mar 12, 2020

Note that this PR broke the streaming CI.

@ffbin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actor handle GC is problematic.

6 participants