Distributed ref counting for serialized ObjectIDs#6945
Distributed ref counting for serialized ObjectIDs#6945stephanie-wang merged 88 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
…rrower's ref count goes to 0
…o unset contained_in when popping refs
|
Test FAILed. |
skip adding ownership info if we already have it to handle duplicate refs
|
Test FAILed. |
- register handler for WaitForRefRemoved - don't create a python reference for arg IDs - pass in client factory into ReferenceCounter - fix bad decrement in PopBorrowerRefs
- don't decrement for IDs on dependency resolution, wait until task finished - add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes
a9828de to
e237c75
Compare
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
@kfstorm, we're trying to merge this ASAP for 0.8.2, but I believe it's breaking a java test on Travis. Can you take a look when you get a chance? Thanks! |
|
@stephanie-wang I've noticed that the Java CI is not stable recently. |
Good to know, thanks! I didn't look through all the runs, but I think it's been failing pretty consistently on one test, and it looks related to this PR. Here is the relevant output from that run: |
Why are these changes needed?
This implements distributed reference counting for object IDs that are serialized and passed to another process. It also implements reference counting for object IDs that get nested in another object.
There should be no change in memory management behavior from the user's perspective. By default, all workers will maintain these new ref counts, but they will not be considered when deciding whether an object is still in scope or not. Therefore, the current behavior is that if the creator of an object ID still has a local reference to the object, then at least one copy of the object will be available in the cluster. The creator of an object ID is the process that called
ray.put()or that submitted the task that returns the object ID. The creator's local reference count includes:ObjectID.The new reference counts added by this PR include:
Eventually, object pinning will also consider the new reference counts when deciding when an object stored in plasma can be evicted. The guarantee will be that if any process in a cluster has a reference to the object ID or an object that transitively contains the object ID, then at least one copy of the object will be available in the cluster. To turn on this behavior, set the
distributed_ref_counting_enabledflag to 1 in the internal config passed toray.init(), like so:Checks
scripts/format.shto lint the changes in this PR.