[core] Fix error handling for plasma put errors#56070
[core] Fix error handling for plasma put errors#56070edoakes merged 5 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves error handling for plasma put errors by centralizing status-to-error-type mapping and ensuring errors are propagated correctly instead of being logged and forgotten. The changes are logical and enhance robustness. I've suggested a small refactoring to reduce code duplication, but overall the changes look good.
00f2cd2 to
3d5dd29
Compare
89b79c2 to
96efc56
Compare
There was a problem hiding this comment.
@jjyao @edoakes @israbbani I don't have any good ideas on this but we have store_in_plasma_ids so that on reconstruction attempts we can put objects or failures back into plasma because getters expect the object to be in plasma. But if we fail to put in plasma for the 3 status reasons, we're kind of putting those getters into a state where they could be waiting forever. Open to ideas on a simple solution for this.
I think at least for obj store full we should retry until we can, like it should hit fallback...
9bbc064 to
8daebfa
Compare
8daebfa to
7c0ae9b
Compare
| in_memory_store_.Put(error, dynamic_return_id); | ||
| } | ||
| } else { | ||
| in_memory_store_.Put(error, dynamic_return_id); |
There was a problem hiding this comment.
did this else exist before your original pr?
There was a problem hiding this comment.
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ry on all errs Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
7c0ae9b to
831821a
Compare
## Why are these changes needed? Followup to ray-project#55367 (review) - map Status -> ErrorType via MapStatusToErrorType (IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY, OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED). - no longer log‑and‑forget on any HandleTaskReturn error. - For streaming generator returns we now always put the error into the in‑memory store first (unblocks waiters) and only then best‑effort put to plasma; failures there are just warnings. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: zac <zac@anyscale.com>
## Why are these changes needed? Followup to #55367 (review) - map Status -> ErrorType via MapStatusToErrorType (IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY, OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED). - no longer log‑and‑forget on any HandleTaskReturn error. - For streaming generator returns we now always put the error into the in‑memory store first (unblocks waiters) and only then best‑effort put to plasma; failures there are just warnings. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
## Why are these changes needed? Followup to ray-project#55367 (review) - map Status -> ErrorType via MapStatusToErrorType (IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY, OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED). - no longer log‑and‑forget on any HandleTaskReturn error. - For streaming generator returns we now always put the error into the in‑memory store first (unblocks waiters) and only then best‑effort put to plasma; failures there are just warnings. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
## Why are these changes needed? Followup to ray-project#55367 (review) - map Status -> ErrorType via MapStatusToErrorType (IOError->LOCAL_RAYLET_DIED, ObjectStoreFull/Transient->OUT_OF_MEMORY, OutOfDisk->OUT_OF_DISK_ERROR; default->WORKER_DIED). - no longer log‑and‑forget on any HandleTaskReturn error. - For streaming generator returns we now always put the error into the in‑memory store first (unblocks waiters) and only then best‑effort put to plasma; failures there are just warnings. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Why are these changes needed?
Followup to #55367 (review)