Skip to content

[core] Fix error handling for plasma put errors#56070

Merged
edoakes merged 5 commits intoray-project:masterfrom
codope:fix-err-handling-PutInLocalPlasmaStore
Sep 9, 2025
Merged

[core] Fix error handling for plasma put errors#56070
edoakes merged 5 commits intoray-project:masterfrom
codope:fix-err-handling-PutInLocalPlasmaStore

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Aug 29, 2025

Why are these changes needed?

Followup to #55367 (review)

  • map Status -> ErrorType via MapStatusToErrorType (IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY, OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED).
  • no longer log‑and‑forget on any HandleTaskReturn error.
  • For streaming generator returns we now always put the error into the in‑memory store first (unblocks waiters) and only then best‑effort put to plasma; failures there are just warnings.

@codope codope requested a review from a team as a code owner August 29, 2025 03:45
@codope codope added the go add ONLY when ready to merge, run all tests label Aug 29, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves error handling for plasma put errors by centralizing status-to-error-type mapping and ensuring errors are propagated correctly instead of being logged and forgotten. The changes are logical and enhance robustness. I've suggested a small refactoring to reduce code duplication, but overall the changes look good.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Aug 29, 2025
@codope codope force-pushed the fix-err-handling-PutInLocalPlasmaStore branch 2 times, most recently from 00f2cd2 to 3d5dd29 Compare September 1, 2025 07:51
@jjyao jjyao requested a review from dayshah September 3, 2025 05:02
@codope codope force-pushed the fix-err-handling-PutInLocalPlasmaStore branch 2 times, most recently from 89b79c2 to 96efc56 Compare September 4, 2025 11:06
Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao @edoakes @israbbani I don't have any good ideas on this but we have store_in_plasma_ids so that on reconstruction attempts we can put objects or failures back into plasma because getters expect the object to be in plasma. But if we fail to put in plasma for the 3 status reasons, we're kind of putting those getters into a state where they could be waiting forever. Open to ideas on a simple solution for this.

I think at least for obj store full we should retry until we can, like it should hit fallback...

@codope codope force-pushed the fix-err-handling-PutInLocalPlasmaStore branch 4 times, most recently from 9bbc064 to 8daebfa Compare September 5, 2025 15:13
@codope codope force-pushed the fix-err-handling-PutInLocalPlasmaStore branch from 8daebfa to 7c0ae9b Compare September 8, 2025 09:53
Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice lgtm

in_memory_store_.Put(error, dynamic_return_id);
}
} else {
in_memory_store_.Put(error, dynamic_return_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this else exist before your original pr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ry on all errs

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope force-pushed the fix-err-handling-PutInLocalPlasmaStore branch from 7c0ae9b to 831821a Compare September 9, 2025 09:04
@edoakes edoakes merged commit f17ba98 into ray-project:master Sep 9, 2025
5 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
## Why are these changes needed?

Followup to
ray-project#55367 (review)

- map Status -> ErrorType via MapStatusToErrorType
(IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY,
OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED).
- no longer log‑and‑forget on any HandleTaskReturn error.
- For streaming generator returns we now always put the error into the
in‑memory store first (unblocks waiters) and only then best‑effort put
to plasma; failures there are just warnings.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
## Why are these changes needed?

Followup to
#55367 (review)

- map Status -> ErrorType via MapStatusToErrorType
(IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY,
OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED).
- no longer log‑and‑forget on any HandleTaskReturn error.
- For streaming generator returns we now always put the error into the
in‑memory store first (unblocks waiters) and only then best‑effort put
to plasma; failures there are just warnings.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
## Why are these changes needed?

Followup to
ray-project#55367 (review)

- map Status -> ErrorType via MapStatusToErrorType
(IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY,
OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED).
- no longer log‑and‑forget on any HandleTaskReturn error.
- For streaming generator returns we now always put the error into the
in‑memory store first (unblocks waiters) and only then best‑effort put
to plasma; failures there are just warnings.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Why are these changes needed?

Followup to
ray-project#55367 (review)

- map Status -> ErrorType via MapStatusToErrorType
(IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY,
OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED).
- no longer log‑and‑forget on any HandleTaskReturn error.
- For streaming generator returns we now always put the error into the
in‑memory store first (unblocks waiters) and only then best‑effort put
to plasma; failures there are just warnings.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants