fix: per-request timeout for shard uploads (XET-885)#685
Conversation
|
@hoytak Do you think I should add a config override for this so users can set an env variable to configure this? |
|
Nice work thanks! Would be nice if there was a patch pypi release for this soon! |
|
If this fixes it, then LGTM! A config var would make sense -- the max timeout -- e.g. large shard timeout? There are situations where 5 min may not be enough. |
This PR has the max time be 10m (clamped at 600s) - which I imagine should be enough for 128MB shard - which I believe is the max shard size. I am building out some tests for this and then will merge. |
|
Fortunately @XciD did some actual testing with this patch and learned that the per-request timeout does not take precedence over the overall read_timeout timing. If overall read_timeout is set then per-request being longer doesn't matter. Need to rethink this fix. Also, the timeout is silently failing - need to add testing to confirm that when timeout occurs we need to actually fail the upload with a reasonable explanation. More in internal Slack thread: https://huggingface.slack.com/archives/C087TU2FE3G/p1772992380446739?thread_ts=1772768281.424279&cid=C087TU2FE3G |
This comment was marked as outdated.
This comment was marked as outdated.
Shard uploads can exceed the global 120s read_timeout when the server processes large shards with many file entries. Add a size-proportional per-request timeout (~1s/KB, clamped to [120s, 600s]) using reqwest's RequestBuilder::timeout() so only shard uploads get the extended window. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
reqwest's WASM client does not support per-request `.timeout()` on RequestBuilder, which breaks the WASM build after the shard upload timeout was added. Gate the `Duration` import and the `.timeout()` call behind `#[cfg(not(target_family = "wasm"))]` so the timeout applies only to native builds where it is needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…885) reqwest's per-request timeout() only overrides ClientBuilder::timeout(), NOT ClientBuilder::read_timeout(). They are independent mechanisms polled as separate futures. The previous approach of setting per-request timeout() was ineffective — the client's read_timeout(120s) still killed uploads when the CAS server stalled during shard processing. Fix: create a non-cached reqwest client per shard upload with the dynamic timeout (scaled at ~1s/KB, clamped to [120s, 600s]) set as its read_timeout. Store construction parameters (auth, session_id, socket path, headers) in RemoteClient to build shard-specific clients on demand. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-885) PR #648 refactored LocalTestServer to use a builder pattern. Update the shard upload timeout integration tests to use LocalTestServerBuilder instead of LocalTestServer::start(). Also call set_api_delay_range directly on the server instead of going through .client(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
db6b448 to
81431fc
Compare
|
did you find why the error was silenced ? worth investigating if we are silenting some other errors. |
Instead of creating a new reqwest client per shard upload request, store a single no-read-timeout client in RemoteClient at construction time. This avoids the overhead of building a new client per request while still solving the core problem: reqwest's per-request timeout() does NOT override the client-level read_timeout(), so shard uploads need a separate client without read_timeout. Changes: - Renamed build_auth_http_client_with_read_timeout to build_auth_http_client_no_read_timeout (no timeout param needed) - Added shard_upload_http_client field to RemoteClient (cfg'd out on wasm) - Removed per-upload client construction from upload_shard() - Removed shard_read_timeout config field (no longer needed) - Removed stored construction params from RemoteClient (no longer needed) - Updated integration test: slow server test now asserts success (the whole point is shard uploads should wait as long as needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
From the analysis it should have reported an ERROR message in the logs after all retries were exhausted. Each retry should be in the logs at INFO level. What is less clear is what makes its way back to Python - but for the testing you were doing that was in Rust, right? Here is the writeup for the swallowed errors from Claude: |
It was rust, but in a e2e test were I didnt had the logs. So I was maybe block on a retry loop |
hoytak
left a comment
There was a problem hiding this comment.
LGTM, let's get this in.
Fixes XET-885
Summary
Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default
read_timeout(120s)on the reqwest client kills these uploads.Key insight: reqwest's per-request
RequestBuilder::timeout()does NOT override the client-levelread_timeout()— they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective.Fix: Create a dedicated
shard_upload_http_clientonRemoteClientwith noread_timeout, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client.Changes
cas_client/src/http_client.rsreqwest_client_no_read_timeout()— creates a reqwest client with noread_timeoutbuild_auth_http_client_no_read_timeout()— public API wrapping it with middlewarecas_client/src/remote_client.rsshard_upload_http_clientfield toRemoteClient(cfg'd out on wasm)upload_shard()uses the pre-built no-timeout client instead of building one per requestcas_client/tests/test_shard_upload_timeout.rsxet_config/src/groups/client.rsshard_read_timeoutconfig field (no longer needed)Verification
```bash
All tests pass
cargo test -p cas_client # 87 lib + 2 integration + 2 doc tests
Clippy clean
cargo clippy -p cas_client -- -D warnings
Format clean
cargo +nightly fmt --check
```