[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg#19910
Conversation
| // Don't clear stored_in values, which may be from previous tasks that | ||
| // created this same object id. | ||
| RAY_CHECK(GetAndClearLocalBorrowersInternal(borrowed_id, &borrowed_refs, | ||
| /*clear_stored=*/false)) |
| // This should be the first time that we have stored this object ID | ||
| // inside this return ID. | ||
| RAY_CHECK(inserted); | ||
| RAY_UNUSED(inner_it->second.stored_in_objects.emplace(object_id, owner_address)); |
There was a problem hiding this comment.
Haven't looked into why this check removal was needed.
|
I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases? What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition. |
Hmm, could we address this by only clearing certain entries from stored_in_objects (e.g., filtering on where the owner == the owner of the current task?)
That would race though, since the outer owner isn't aware of the object until the task finishes, unless we add logic to fix that up in this case. I also think it's fundamentally not addressing the issue of store_in being incorrectly cleared and returned to the wrong process. |
That sounds a bit complicated... I think that case happens normally too, not just when there is ownership change.
I don't think that's true, the new owner tells the outer object owner that it has a nested ObjectID during the WaitForRefRemoved RPC (that's why we have the contained_in_id arg). |
| // foreign owner from learning about the parent task borrowing this value. | ||
| if (!it->second.foreign_owner_already_monitoring) { | ||
| it->second.stored_in_objects.clear(); | ||
| } |
|
Per offline discussion, updated to track when an object is created with a foreign owner process explicitly, and skip clearing in that case. This is a light-weight version of tracking what metadata was added by the task specifically; that might be a needed refactoring but would be a much larger change. Open to suggestions on how to write a unit tests here; otherwise this is covered by test_dataset.py::test_callable_classes which will immediately fail without this PR. |
It should be possible to write this with a unit test. Let me know if you need help with it. Here's an example that might be useful to copy. |
| owner->rc_.RemoveLocalReference(return_id2, nullptr); | ||
| ASSERT_FALSE(owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(caller->rc_.HasReference(inner_id)); |
There was a problem hiding this comment.
@stephanie-wang need a bit a bit of help figuring out why this last assert is failing.
There was a problem hiding this comment.
I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).
There was a problem hiding this comment.
It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).
stephanie-wang
left a comment
There was a problem hiding this comment.
The fix looks good! I had some questions about the test case.
| owner->rc_.RemoveLocalReference(return_id2, nullptr); | ||
| ASSERT_FALSE(owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(caller->rc_.HasReference(inner_id)); |
There was a problem hiding this comment.
I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).
ericl
left a comment
There was a problem hiding this comment.
Updated, thanks for the tips on the test!
| owner->rc_.RemoveLocalReference(return_id2, nullptr); | ||
| ASSERT_FALSE(owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id)); | ||
| ASSERT_FALSE(caller->rc_.HasReference(inner_id)); |
There was a problem hiding this comment.
It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).
| it->second.borrowers.clear(); | ||
| it->second.stored_in_objects.clear(); | ||
| if (for_ref_removed || !it->second.foreign_owner_already_monitoring) { | ||
| borrowed_refs->emplace(object_id, it->second); |
There was a problem hiding this comment.
Had to also move this into the conditional, to avoid a reference leak.
stephanie-wang
left a comment
There was a problem hiding this comment.
Nice! It's great to have more unit testing on this codepath.
…ct from going out of scope (#21719) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.
…ct from going out of scope (#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
…ct from going out of scope (ray-project#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in ray-project#19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in ray-project#19910, but generalized to all initial ObjectRefs. This is a re-merge for ray-project#21719 with a fix for removing the owned object ref if creation fails.
Why are these changes needed?
This fixes a reference counting bug in the following scenario.
return [ray.put(x, _owner=block_owner)]. We record in the ref table forxthat the caller owns the object thatxis stored in (stored_in=[caller]).xas an argument. After the task finishes, we callPopAndClearLocalBorrowers, which erases thestored_inmap of the reference table and sends it to the caller. But the caller is not the owner ofx--- the real owner ofxhas yet to send theWaitForRefRemovedRPC.WaitForRefRemovedRPC from the block_owner arrives at the same worker. There are no local refs toxany more, and thestored_inmap is empty. The owner thinksxcan be deleted, however the caller still has a reference tox---> raising ReferenceCountingAssertionError.The proposed fix here is to not clear
stored_inin PopAndClearLocalBorrowers().Related issue number
This fixes test_dataset_pipeline.py::pipeline_actors and test_dataset.py::test_callable_classes (both mixing tasks/actors returning/consuming the same object ref) in #19907.