[core] Move out-of-memory handling into the plasma store and support async object creation#12186
Conversation
|
//python/ray/tests:test_reference_counting TIMEOUT in 3 out of 3 in 315.0s |
|
|
|
Hi @stephanie-wang, I got the following errors when call |
|
@stephanie-wang I believe this PR breaks Java CI. I see this in the CI results of the last commit of this PR: |
Can you post an issue with more information to reproduce? |
|
@stephanie-wang Please take a look at the Java CI in this link: https://github.com/ray-project/ray/runs/1482982242, or you can find similar failures in recent master builds. |
…support async object creation (ray-project#12186)" This reverts commit 443339a.
After cleaning up all bazel cache and rebuilt. It works now. |
Why are these changes needed?
This moves out-of-memory handling into the plasma store instead of letting the plasma clients retry, to improve the system's ability to detect and respond to OOM conditions (#11772). All retry logic, LRU eviction logic, etc is now moved into the plasma store.
This also requires making the object
Createrequest to the object store async so that:Createwhile another thread is trying to release memory.To fix 1, this PR adds an option to fulfill or fail a creation request immediately. To fix 2, this PR adds a "request ID" to creation replies that the worker uses to contact the plasma store again.
Related issue number
Closes #11772 and #11994.
Checks
scripts/format.shto lint the changes in this PR.I thought about ways to try and split this PR down, but it's a little challenging because I don't think the system would be stable if we only moved OOM handling into the plasma store but didn't address 1 or 2.