[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope by stephanie-wang · Pull Request #21719 · ray-project/ray

stephanie-wang · 2022-01-20T01:54:33Z

Why are these changes needed?

When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang · 2022-01-24T22:23:05Z

I still need to fix up the ref counting in Java and C++ language frontends, but the Python version is ready for review.

stephanie-wang · 2022-01-24T22:35:31Z

Oh turns out the C++ frontend doesn't actually do ref counting.

ericl

Looks good, but I don't have the state to give a really detailed review. Maybe trigger nightly tests to flush out any edge cases?

python/ray/_raylet.pyx

ericl · 2022-01-25T00:00:44Z

python/ray/_raylet.pyx

+        prepare_args_internal(core_worker, language, args, args_vector,
+                              function_descriptor, put_arg_ids)
+    except Exception as e:
+        # An error occurred during arg serialization. We must remove the


Hmm, is it possible the refs haven't been added yet?

We don't add to this list unless we've already incremented the local ref. Updated the variable name to clarify.

python/ray/_raylet.pyx

stephanie-wang · 2022-01-27T07:03:31Z

Ran nightly, I think all of these test failures are from master:

https://buildkite.com/ray-project/periodic-ci/builds/2519

ericl · 2022-01-27T21:10:02Z

LINT errors, etc. Can you rebase?

bveeramani · 2022-01-30T05:07:17Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

stephanie-wang · 2022-02-04T01:29:55Z

Test failures look unrelated (on the last commit it passed all but Java, and Java's now passing).

…ent object from going out of scope (ray-project#21719)" This reverts commit e3af828.

…ct from going out of scope (#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.

…ct from going out of scope (ray-project#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in ray-project#19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in ray-project#19910, but generalized to all initial ObjectRefs. This is a re-merge for ray-project#21719 with a fix for removing the owned object ref if creation fails.

stephanie-wang added 3 commits January 19, 2022 15:43

Always add local ref for put objects

b9db323

Add local ref with owned object

317872c

update

cef0ec0

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2022

stephanie-wang mentioned this pull request Jan 20, 2022

[core] Recover spilled objects that are lost during node failure #21485

Merged

6 tasks

stephanie-wang added 2 commits January 24, 2022 11:27

Merge remote-tracking branch 'upstream/master' into pin-object-refs

1bc6709

Fix tests

9113269

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2022

stephanie-wang assigned ericl Jan 24, 2022

java

1be519e

ericl reviewed Jan 25, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 25, 2022

comments

565ab25

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2022

lint

18195e7

ericl approved these changes Jan 27, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2022

stephanie-wang added 3 commits January 27, 2022 14:50

fixes

ee85aa4

Merge remote-tracking branch 'upstream/master' into pin-object-refs

fe8bf62

fix

69d7cf3

stephanie-wang added 3 commits February 3, 2022 12:00

x

920660f

Merge remote-tracking branch 'upstream/master' into pin-object-refs

e2835a1

x

f04522c

stephanie-wang merged commit e3af828 into ray-project:master Feb 4, 2022

stephanie-wang deleted the pin-object-refs branch February 4, 2022 01:31

rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 4, 2022

Revert "[core] Increment ref count when creating an ObjectRef to prev…

ecf55cd

…ent object from going out of scope (ray-project#21719)" This reverts commit e3af828.

rkooo567 mentioned this pull request Feb 4, 2022

Revert "[core] Increment ref count when creating an ObjectRef to prev… #22106

Merged

6 tasks

stephanie-wang mentioned this pull request Feb 4, 2022

[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope #22120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope#21719

[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope#21719
stephanie-wang merged 14 commits intoray-project:masterfrom
stephanie-wang:pin-object-refs

stephanie-wang commented Jan 20, 2022 •

edited

Loading

Uh oh!

stephanie-wang commented Jan 24, 2022

Uh oh!

stephanie-wang commented Jan 24, 2022

Uh oh!

ericl left a comment

Uh oh!

Uh oh!

ericl Jan 25, 2022

Uh oh!

stephanie-wang Jan 27, 2022

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Jan 27, 2022

Uh oh!

ericl commented Jan 27, 2022

Uh oh!

bveeramani commented Jan 30, 2022

Uh oh!

stephanie-wang commented Feb 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stephanie-wang commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

stephanie-wang commented Jan 24, 2022

Uh oh!

stephanie-wang commented Jan 24, 2022

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericl Jan 25, 2022

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Jan 27, 2022

Uh oh!

ericl commented Jan 27, 2022

Uh oh!

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

Uh oh!

stephanie-wang commented Feb 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephanie-wang commented Jan 20, 2022 •

edited

Loading