Skip to content

Adding retry for unhandled io errors when sending requests#468

Merged
jgodlew merged 4 commits intomainfrom
jgodlew/retry-io-errors
Sep 10, 2025
Merged

Adding retry for unhandled io errors when sending requests#468
jgodlew merged 4 commits intomainfrom
jgodlew/retry-io-errors

Conversation

@jgodlew
Copy link
Contributor

@jgodlew jgodlew commented Aug 22, 2025

Allows us to retry errors when we receive I/O errors when sending requests. For example, on macOS, when we see:

No buffer space available (os error 55)

when downloading from S3, we should wait and retry once the system has more network resources.

@jgodlew jgodlew requested review from assafvayner and seanses August 22, 2025 19:54
@seanses
Copy link
Collaborator

seanses commented Aug 25, 2025

I just ran a test on this PR, it seems that the "os error 55" is not caught or retried: https://github.com/huggingface/xet-scenario-tests/actions/runs/17216048201/job/48839818576#step:13:196

@jgodlew
Copy link
Contributor Author

jgodlew commented Sep 10, 2025

@seanses the errors are now being caught and retried properly: ci-run. Downloading the logs from that run shows that there are Retry attempt #... logs being printed and no ERRORs.

However, in testing, I did notice that sometimes, retrying 5 times was insufficient. Lowering the number of concurrent range gets (export HF_XET_NUM_CONCURRENT_RANGE_GETS=32) should help reduce the number of connections established, reducing the likelihood of running out of available mbufs on macOS (likely source of the os error 55 / ENOBUFS).

@jgodlew jgodlew merged commit f24c97e into main Sep 10, 2025
6 checks passed
@jgodlew jgodlew deleted the jgodlew/retry-io-errors branch September 10, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants