[core] Ref counting for actor handles by stephanie-wang · Pull Request #7434 · ray-project/ray

stephanie-wang · 2020-03-04T00:52:14Z

Why are these changes needed?

Before, actors were kept alive as long as the creator's reference to the actor was active. This meant that if the creator exited, the actor might exit even though other handles to the actor were still active.

This PR reuses the ref counting for normal ObjectIDs to implement ref counting for actor handles by adding an ObjectID to the Python ActorHandle object. After this PR, an actor will be kept alive as the process that created it is still alive and there are any references to the actor's handle in scope.

Because distributed reference counting (#6945) is still feature-flagged off, this will only cover cases where an actor handle is passed by the original creator. If the actor handle is passed by a task/actor that did not originally create it, then the actor may exit early.

Related issue number

Doesn't fix it yet, but one step closer to #6370
Closes #3472

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

…counting

AmplabJenkins · 2020-03-04T00:52:35Z

Can one of the admins verify this patch?

ericl · 2020-03-04T01:18:10Z

src/ray/core_worker/core_worker.cc

+    status = local_raylet_client_->SubmitTask(task_spec);
  }
+
+  *actor_object_id = return_ids[0];


Is it true that this value will always be the same as ObjectID::ForTaskReturn(TaskID::ForActorCreationTask(actor_id), 0)? I didn't see any random bits in this.

If so, maybe we can generate the id on the fly instead of storing it.

Oh hmm yeah I didn't see that function. That should work for C++.

We'll still have to store it in Python, though, since we're relying on the ObjectID's python ref to track when the local ref goes out of scope and when the handle gets passed into a task.

AmplabJenkins · 2020-03-04T01:31:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22692/
Test FAILed.

AmplabJenkins · 2020-03-04T01:45:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22693/
Test FAILed.

ericl · 2020-03-04T06:22:49Z

Could you do the callback when the actor handle goes out of scope and treat it as the id?

…

On Tue, Mar 3, 2020, 10:01 PM Stephanie Wang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/ray/core_worker/core_worker.cc <#7434 (comment)>: > } + + *actor_object_id = return_ids[0]; Oh hmm yeah I didn't see that function. That should work for C++. We'll still have to store it in Python, though, since we're relying on the ObjectID's python ref to track when the local ref goes out of scope and when the handle gets passed into a task. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7434?email_source=notifications&email_token=AAADUSSFHRHCJRRYFS2ITPLRFXVCHA5CNFSM4LAXOCG2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCX3TFIQ#discussion_r387463181>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSU3BMKIL63CPLRF62TRFXVCHANCNFSM4LAXOCGQ> .

stephanie-wang · 2020-03-04T06:29:32Z

Could you do the callback when the actor handle goes out of scope and treat it as the id?

Yes, but that won't cover the cases where the actor handle gets passed or nested inside other objects. If you have suggestions on how to do it, I can try it, but this seemed much simpler to me.

ericl · 2020-03-04T06:41:10Z

Right, I guess I was thinking you can synthesize the object id on the fly in hooks for those cases too. Not sure how much code that adds.

…

On Tue, Mar 3, 2020, 10:29 PM Stephanie Wang ***@***.***> wrote: Could you do the callback when the actor handle goes out of scope and treat it as the id? Yes, but that won't cover the cases where the actor handle gets passed or nested inside other objects. If you have suggestions on how to do it, I can try it, but this seemed much simpler to me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7434?email_source=notifications&email_token=AAADUSVJYSVKPPYZE4ZA2SLRFXYM3A5CNFSM4LAXOCG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWREVQ#issuecomment-594350678>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSTKIH4MYICYDGTLHGLRFXYM3ANCNFSM4LAXOCGQ> .

AmplabJenkins · 2020-03-04T07:41:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22714/
Test FAILed.

stephanie-wang · 2020-03-04T17:47:07Z

Hmm let me try it and see. It might be good to separate out the logic about actor handles more.

…

On Tue, Mar 3, 2020 at 10:41 PM Eric Liang ***@***.***> wrote: Right, I guess I was thinking you can synthesize the object id on the fly in hooks for those cases too. Not sure how much code that adds. On Tue, Mar 3, 2020, 10:29 PM Stephanie Wang ***@***.***> wrote: > Could you do the callback when the actor handle goes out of scope and > treat it as the id? > > Yes, but that won't cover the cases where the actor handle gets passed or > nested inside other objects. If you have suggestions on how to do it, I can > try it, but this seemed much simpler to me. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > < #7434?email_source=notifications&email_token=AAADUSVJYSVKPPYZE4ZA2SLRFXYM3A5CNFSM4LAXOCG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWREVQ#issuecomment-594350678 >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAADUSTKIH4MYICYDGTLHGLRFXYM3ANCNFSM4LAXOCGQ > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7434?email_source=notifications&email_token=AATREBA6E23XMEV7YHKWK6LRFXZYPA5CNFSM4LAXOCG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWR42Q#issuecomment-594353770>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATREBGQWPSZUN5VCUH5GE3RFXZYPANCNFSM4LAXOCGQ> .

edoakes

Looks great!

The changes to the Exit() logic look nice too - might be better to separate that into a different PR though. I think @kfstorm was going to add the same logic in #7346. Maybe you could separate the Exit() logic here into a separate PR and then both this and #7346 can rebase off of that?

LMK when you update with change that Eric suggested and I'll look again.

edoakes · 2020-03-04T21:19:18Z

python/ray/actor.py

+            state = worker.core_worker.serialize_actor_handle(
+                self._ray_actor_id)
+            state = (state, self._ray_actor_creation_return_id)


Suggested change

state = worker.core_worker.serialize_actor_handle(

self._ray_actor_id)

state = (state, self._ray_actor_creation_return_id)

state = (worker.core_worker.serialize_actor_handle(

self._ray_actor_id), self._ray_actor_creation_return_id)

So cool you can do multi-line suggestions now!

…counting

stephanie-wang · 2020-03-05T07:43:34Z

@edoakes I think the overlap with #7346 looks small enough that I'd rather just leave the changes in this PR and resolve the conflict later, if that's okay with you.

This PR is ready to review again.

AmplabJenkins · 2020-03-05T07:48:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22758/
Test FAILed.

AmplabJenkins · 2020-03-05T08:11:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22760/
Test FAILed.

AmplabJenkins · 2020-03-05T20:23:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22777/
Test FAILed.

edoakes · 2020-03-05T22:01:53Z

Sure that's fine - it's nice to try to separate independent/standalone changes into separate PRs whenever possible but obviously not a big deal in this case given that both changes are relatively small. Having another look now.

edoakes

Everything looks good except for the comment I left about overwriting the pending force kills.

Also, one thing to keep in mind is it might be nice to extend the changes here to add an API to check when an actor creation task has finished without submitting a new task. Are the currently changes compatible with that? Seems like we just need to be able to resolve the object ID associated with an actor handle in the same way that we do for serialized object IDs. This actually might have been easier with the previous iteration that stored the object ID :/

edoakes · 2020-03-05T22:05:02Z

python/ray/serialization.py

-            return obj._serialization_helper(True)
+            serialized, actor_handle_id = obj._serialization_helper()
+            # Update ref counting for the actor handle
+            if self.is_in_band_serialization():


Looks like this code block is verbatim the same as in the object_id_serializer - might be nice to separate into another method so they maintain parity?

edoakes · 2020-03-05T22:09:52Z

src/ray/core_worker/transport/direct_actor_transport.cc

+                                                     bool force_kill) {
  absl::MutexLock lock(&mu_);
-  pending_force_kills_.insert(actor_id);
+  pending_force_kills_[actor_id] = force_kill;


If I'm understanding this correctly, I think it would cause issues for the following example:

a = Actor.remote() hanging_id = a.hang_forever.remote() ray.kill(a) del a ray.get(hanging_id)

In this case the user would expect hanging_id to return an error because it was cancelled by the force kill, but the code here may overwrite the pending force kill with a graceful one, leading the actor to hang.

stephanie-wang · 2020-03-05T22:40:00Z

Everything looks good except for the comment I left about overwriting the pending force kills.

Also, one thing to keep in mind is it might be nice to extend the changes here to add an API to check when an actor creation task has finished without submitting a new task. Are the currently changes compatible with that? Seems like we just need to be able to resolve the object ID associated with an actor handle in the same way that we do for serialized object IDs. This actually might have been easier with the previous iteration that stored the object ID :/

Er I'm not sure if I totally understand, but yes, I was thinking that eventually we would like to do something similar where we "resolve" an actor handle by asking its owner about the status of the actor creation task. I was actually thinking this separation would be good so that it's clear what the differences are between resolving actor handles vs normal objects.

AmplabJenkins · 2020-03-05T23:46:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22784/
Test FAILed.

AmplabJenkins · 2020-03-05T23:49:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22786/
Test FAILed.

edoakes

LGTM! Should wait for @zhijunfu to have a look before merging.

AmplabJenkins · 2020-03-06T02:07:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22791/
Test FAILed.

zhijunfu · 2020-03-06T12:34:06Z

LGTM! Should wait for @zhijunfu to have a look before merging.

Thanks, took a quick look ant it looks good to me.

AmplabJenkins · 2020-03-06T21:41:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22823/
Test FAILed.

AmplabJenkins · 2020-03-09T19:33:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22905/
Test FAILed.

AmplabJenkins · 2020-03-09T22:16:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22913/
Test FAILed.

AmplabJenkins · 2020-03-10T17:53:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22973/
Test FAILed.

AmplabJenkins · 2020-03-11T01:11:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22982/
Test FAILed.

jovany-wang · 2020-03-12T05:29:28Z

Note that this PR broke the streaming CI.

@ffbin

stephanie-wang added 8 commits February 27, 2020 15:20

tmp

3d41665

Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

b90b7eb

Merge remote-tracking branch 'upstream/master' into actor-handle-ref-…

f1daefe

…counting

fix build

61be31d

Remove __ray_terminate__ and add test case for distributed ref counting

f82d716

lint

c4a0b0d

Remove unused

3ff3199

Fixes for detached actor, duplicate actor handles

03ec020

Remove unused

06bf466

ericl reviewed Mar 4, 2020

View reviewed changes

Remove creation return ID

99912d3

edoakes reviewed Mar 4, 2020

View reviewed changes

edoakes mentioned this pull request Mar 4, 2020

Add a flag to disable reconstruction for a killed actor #7346

Merged

stephanie-wang added 4 commits March 4, 2020 23:12

Remove ObjectIDs from python, set references in CoreWorker

1816b1e

Merge remote-tracking branch 'upstream/master' into actor-handle-ref-…

e3e2765

…counting

Fix crash

6c6553e

Fix memory crash

fac81f5

Fix tests

86889e6

edoakes reviewed Mar 5, 2020

View reviewed changes

fix

cdc90ac

fixes

4c90cc9

zhijunfu self-requested a review March 6, 2020 00:44

fix tests

d0873fb

edoakes approved these changes Mar 6, 2020

View reviewed changes

fix java build

5cb747d

fix build

822409d

fix

7711f02

check status

081871a

check status

0d19366

stephanie-wang merged commit fdb5285 into ray-project:master Mar 11, 2020

stephanie-wang deleted the actor-handle-ref-counting branch March 11, 2020 00:46

kfstorm mentioned this pull request Sep 23, 2020

Support actor handle ref counting for Java #10976

Closed

Conversation

stephanie-wang commented Mar 4, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Mar 4, 2020

Uh oh!

ericl Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 4, 2020

Uh oh!

AmplabJenkins commented Mar 4, 2020

Uh oh!

ericl commented Mar 4, 2020 via email

Uh oh!

stephanie-wang commented Mar 4, 2020

Uh oh!

ericl commented Mar 4, 2020 via email

Uh oh!

AmplabJenkins commented Mar 4, 2020

Uh oh!

stephanie-wang commented Mar 4, 2020 via email

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Mar 5, 2020

Uh oh!

AmplabJenkins commented Mar 5, 2020

Uh oh!

AmplabJenkins commented Mar 5, 2020

Uh oh!

AmplabJenkins commented Mar 5, 2020

Uh oh!

edoakes commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

edoakes Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Mar 5, 2020

Uh oh!

AmplabJenkins commented Mar 5, 2020

Uh oh!

AmplabJenkins commented Mar 5, 2020

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 6, 2020

Uh oh!

zhijunfu commented Mar 6, 2020

Uh oh!

AmplabJenkins commented Mar 6, 2020

Uh oh!

AmplabJenkins commented Mar 9, 2020

Uh oh!

AmplabJenkins commented Mar 9, 2020

Uh oh!

AmplabJenkins commented Mar 10, 2020

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

edoakes commented Mar 5, 2020 •

edited

Loading

jovany-wang commented Mar 12, 2020 •

edited

Loading