Retry when push failed#12872
Retry when push failed#12872fishbone wants to merge 1 commit intoray-project:masterfrom fishbone:fix-push-failure
Conversation
|
Run this script on prod pkg: Sometimes, it'll hang there like this: This happens because for Will you suggest adding such info into grpc? Besides this, I also notice one other issue. The hanging behavior won't always happen, and when it doesn't happen, it'll show logs like this: I think this is the one I encountered before when I thought the hanging disappeared. From the log, it looks like the object got sealed/added somehow first, which makes the first wait return the object in ready status. Somehow later exception got thrown later. I haven't investigated this one yet, do you have some ideas about this? |
| RAY_LOG(DEBUG) << "Push for " << dest_id << ", " << obj_id | ||
| << " completed, remaining: " << NumPushesInFlight(); | ||
| } | ||
| } else { |
There was a problem hiding this comment.
I plan only resend errors like: status.IsOutOfMemory() || status.IsTimedOut() || status.IsObjectStoreFull() || status.IsTransientObjectStoreFull()), but it looks like we convert all grpc::Status to ray::Status::IOError . Extra information is needed to be returned.
|
Wired... I started from a new environment and do install a nightly build and it looks like both cases just disappeared :( |
|
I'm reopening this PR. The nightly build I'm using somehow is way back to Nov 3 when I check Btw, I reverted this PR #12335 and at least the hanging thing won't happen. There are still sometimes, the first wait return the item as ready. I suppose it's because the current API will return items ready if they appear someplace in the cluster. After introducing fetch_local, I think this needs to be fixed too. |
|
@wuisawesome can you look into the hang (see test case above)? it seems we are missing the push retry in #12335 |
|
I see this message repeated 13 times in the raylet log: nothing after it. |
|
Problem is the retry timer isn't getting reset. I'll submit a PR shortly. |
|
Thanks for working on the fix. Please let me know when you fixed this and I'll rebase my PR on yours. |
|
@ahbone here's the PR: #12907 Thanks for the nice report/reproduction! |
|
np |
Why are these changes needed?
Related issue number
Checks
scripts/format.shto lint the changes in this PR.