Skip to content

Distributed ref counting for serialized ObjectIDs#6945

Merged
stephanie-wang merged 88 commits intoray-project:masterfrom
stephanie-wang:ref-counting
Feb 19, 2020
Merged

Distributed ref counting for serialized ObjectIDs#6945
stephanie-wang merged 88 commits intoray-project:masterfrom
stephanie-wang:ref-counting

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang commented Jan 28, 2020

Why are these changes needed?

This implements distributed reference counting for object IDs that are serialized and passed to another process. It also implements reference counting for object IDs that get nested in another object.

There should be no change in memory management behavior from the user's perspective. By default, all workers will maintain these new ref counts, but they will not be considered when deciding whether an object is still in scope or not. Therefore, the current behavior is that if the creator of an object ID still has a local reference to the object, then at least one copy of the object will be available in the cluster. The creator of an object ID is the process that called ray.put() or that submitted the task that returns the object ID. The creator's local reference count includes:

  1. Local Python references to the ObjectID.
  2. Number of pending tasks submitted by the creator that depend on the object.

The new reference counts added by this PR include:

  1. Number of objects that contain the ObjectID that are still in scope.
  2. List of processes that have a local reference to the ObjectID.

Eventually, object pinning will also consider the new reference counts when deciding when an object stored in plasma can be evicted. The guarantee will be that if any process in a cluster has a reference to the object ID or an object that transitively contains the object ID, then at least one copy of the object will be available in the cluster. To turn on this behavior, set the distributed_ref_counting_enabled flag to 1 in the internal config passed to ray.init(), like so:

ray.init(_internal_config=json.dumps({
    "distributed_ref_counting_enabled": 1,
  }))

Checks

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/21125/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/21259/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/21345/
Test FAILed.

- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes
stephanie-wang and others added 8 commits February 17, 2020 18:21
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22056/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22058/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22057/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22060/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22078/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22085/
Test FAILed.

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

@kfstorm, we're trying to merge this ASAP for 0.8.2, but I believe it's breaking a java test on Travis. Can you take a look when you get a chance? Thanks!

Logs: https://api.travis-ci.com/v3/job/288659419/log.txt

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Feb 19, 2020

@stephanie-wang I've noticed that the Java CI is not stable recently. testInWorker is one of the flaky tests. Did you rerun it? If you retried several times and still coundn't get it pass, please just ignore it. I'll spend some time investigating the these flaky tests.

@stephanie-wang stephanie-wang merged commit f76ce83 into ray-project:master Feb 19, 2020
@stephanie-wang
Copy link
Copy Markdown
Contributor Author

@stephanie-wang I've noticed that the Java CI is not stable recently. testInWorker is one of the flaky tests. Did you rerun it? If you retried several times and still coundn't get it pass, please just ignore it. I'll spend some time investigating the these flaky tests.

Good to know, thanks! I didn't look through all the runs, but I think it's been failing pretty consistently on one test, and it looks related to this PR. Here is the relevant output from that run:

FAILED: 
org.ray.api.exception.RayWorkerException: The worker died unexpectedly while executing this task.
	at org.ray.runtime.object.ObjectSerializer.deserialize(ObjectSerializer.java:48)
	at org.ray.runtime.object.ObjectStore.get(ObjectStore.java:99)
	at org.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:78)
	at org.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:72)
	at org.ray.api.Ray.get(Ray.java:71)
	at org.ray.runtime.object.RayObjectImpl.get(RayObjectImpl.java:37)
	at org.ray.api.test.MultiThreadingTest.testInWorker(MultiThreadingTest.java:134)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:86)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:643)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:820)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1128)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
	at org.testng.TestRunner.privateRun(TestRunner.java:782)
	at org.testng.TestRunner.run(TestRunner.java:632)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:366)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:361)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:319)
	at org.testng.SuiteRunner.run(SuiteRunner.java:268)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1244)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1169)
	at org.testng.TestNG.run(TestNG.java:1064)
	at org.testng.TestNG.privateMain(TestNG.java:1385)
	at org.testng.TestNG.main(TestNG.java:1354)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants