Skip to content

fix: per-request timeout for shard uploads (XET-885)#685

Merged
rajatarya merged 9 commits intomainfrom
rajat/shard-upload-per-request-timeout
Mar 11, 2026
Merged

fix: per-request timeout for shard uploads (XET-885)#685
rajatarya merged 9 commits intomainfrom
rajat/shard-upload-per-request-timeout

Conversation

@rajatarya
Copy link
Collaborator

@rajatarya rajatarya commented Mar 6, 2026

Fixes XET-885

Summary

Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default read_timeout(120s) on the reqwest client kills these uploads.

Key insight: reqwest's per-request RequestBuilder::timeout() does NOT override the client-level read_timeout() — they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective.

Fix: Create a dedicated shard_upload_http_client on RemoteClient with no read_timeout, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client.

Changes

cas_client/src/http_client.rs

  • Added reqwest_client_no_read_timeout() — creates a reqwest client with no read_timeout
  • Added build_auth_http_client_no_read_timeout() — public API wrapping it with middleware
  • 4 unit tests for the new builder

cas_client/src/remote_client.rs

  • Added shard_upload_http_client field to RemoteClient (cfg'd out on wasm)
  • upload_shard() uses the pre-built no-timeout client instead of building one per request

cas_client/tests/test_shard_upload_timeout.rs

  • Updated: slow server test now asserts success (shard uploads should wait as long as needed)

xet_config/src/groups/client.rs

  • Removed shard_read_timeout config field (no longer needed)

Verification

```bash

All tests pass

cargo test -p cas_client # 87 lib + 2 integration + 2 doc tests

Clippy clean

cargo clippy -p cas_client -- -D warnings

Format clean

cargo +nightly fmt --check
```

@rajatarya rajatarya requested review from assafvayner and hoytak March 6, 2026 05:24
@rajatarya
Copy link
Collaborator Author

@hoytak Do you think I should add a config override for this so users can set an env variable to configure this?

@danielhanchen
Copy link

Nice work thanks! Would be nice if there was a patch pypi release for this soon!

@hoytak
Copy link
Collaborator

hoytak commented Mar 7, 2026

If this fixes it, then LGTM!

A config var would make sense -- the max timeout -- e.g. large shard timeout? There are situations where 5 min may not be enough.

@rajatarya
Copy link
Collaborator Author

If this fixes it, then LGTM!

A config var would make sense -- the max timeout -- e.g. large shard timeout? There are situations where 5 min may not be enough.

This PR has the max time be 10m (clamped at 600s) - which I imagine should be enough for 128MB shard - which I believe is the max shard size.

I am building out some tests for this and then will merge.

@rajatarya
Copy link
Collaborator Author

Fortunately @XciD did some actual testing with this patch and learned that the per-request timeout does not take precedence over the overall read_timeout timing. If overall read_timeout is set then per-request being longer doesn't matter.

Need to rethink this fix. Also, the timeout is silently failing - need to add testing to confirm that when timeout occurs we need to actually fail the upload with a reasonable explanation.

More in internal Slack thread: https://huggingface.slack.com/archives/C087TU2FE3G/p1772992380446739?thread_ts=1772768281.424279&cid=C087TU2FE3G

@rajatarya

This comment was marked as outdated.

rajatarya and others added 8 commits March 9, 2026 12:54
Shard uploads can exceed the global 120s read_timeout when the server
processes large shards with many file entries. Add a size-proportional
per-request timeout (~1s/KB, clamped to [120s, 600s]) using reqwest's
RequestBuilder::timeout() so only shard uploads get the extended window.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
reqwest's WASM client does not support per-request `.timeout()` on
RequestBuilder, which breaks the WASM build after the shard upload
timeout was added. Gate the `Duration` import and the `.timeout()`
call behind `#[cfg(not(target_family = "wasm"))]` so the timeout
applies only to native builds where it is needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…885)

reqwest's per-request timeout() only overrides ClientBuilder::timeout(),
NOT ClientBuilder::read_timeout(). They are independent mechanisms polled
as separate futures. The previous approach of setting per-request timeout()
was ineffective — the client's read_timeout(120s) still killed uploads
when the CAS server stalled during shard processing.

Fix: create a non-cached reqwest client per shard upload with the dynamic
timeout (scaled at ~1s/KB, clamped to [120s, 600s]) set as its
read_timeout. Store construction parameters (auth, session_id, socket
path, headers) in RemoteClient to build shard-specific clients on demand.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-885)

PR #648 refactored LocalTestServer to use a builder pattern. Update
the shard upload timeout integration tests to use LocalTestServerBuilder
instead of LocalTestServer::start(). Also call set_api_delay_range
directly on the server instead of going through .client().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rajatarya rajatarya force-pushed the rajat/shard-upload-per-request-timeout branch from db6b448 to 81431fc Compare March 9, 2026 19:58
@XciD
Copy link
Member

XciD commented Mar 10, 2026

did you find why the error was silenced ? worth investigating if we are silenting some other errors.
Could have been a retry policy failling over and over in my case, but worth checking

Instead of creating a new reqwest client per shard upload request,
store a single no-read-timeout client in RemoteClient at construction
time. This avoids the overhead of building a new client per request
while still solving the core problem: reqwest's per-request timeout()
does NOT override the client-level read_timeout(), so shard uploads
need a separate client without read_timeout.

Changes:
- Renamed build_auth_http_client_with_read_timeout to
  build_auth_http_client_no_read_timeout (no timeout param needed)
- Added shard_upload_http_client field to RemoteClient (cfg'd out on wasm)
- Removed per-upload client construction from upload_shard()
- Removed shard_read_timeout config field (no longer needed)
- Removed stored construction params from RemoteClient (no longer needed)
- Updated integration test: slow server test now asserts success
  (the whole point is shard uploads should wait as long as needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rajatarya rajatarya requested review from XciD and hoytak March 11, 2026 04:36
@rajatarya
Copy link
Collaborator Author

did you find why the error was silenced ? worth investigating if we are silenting some other errors. Could have been a retry policy failling over and over in my case, but worth checking

From the analysis it should have reported an ERROR message in the logs after all retries were exhausted. Each retry should be in the logs at INFO level. What is less clear is what makes its way back to Python - but for the testing you were doing that was in Rust, right?

Here is the writeup for the swallowed errors from Claude:

  Why Timeout Errors Appear Silent

  The error path is: reqwest timeout → RetryWrapper → upload_shard → shard_interface → file_upload_session::finalize → upload_commit → user.

  The error does propagate — it's not silently discarded in code. But here's why it appeared silent:

  Root Cause: Timeouts are classified as Retryable::Transient

  In on_request_failure (line 473):
  if error.is_timeout() || is_connect {
      Some(Retryable::Transient)
  }

  This means:

  1. Each timeout is logged at info level, not error — process_error_response logs transient errors with log_as_info = true (line 115). If the operator or user doesn't have info-level logging enabled, they see nothing during the retry attempts.
  2. The retry loop silently retries 3 times (default retry_max_attempts). With the old 120s read_timeout, that's up to 6 minutes of silent retries before failing. Each retry hits the same timeout.
  3. The final "No more retries; aborting" IS logged at error level (line 320), but only after exhausting all retries.
  4. From the user's perspective: the upload hangs for ~6 minutes (3 × 120s timeouts), then fails with a generic CasClientError that wraps the reqwest timeout. The error message may not clearly indicate "shard upload timed out" — it's wrapped in layers of error types.

  Contributing factors:

  - No progress indication during shard upload — unlike xorb uploads which have progress callbacks, shard upload has no intermediate progress reporting. The user sees the upload "freeze" with no explanation.
  - Default log level may hide the info-level retry messages — if INFORMATION_LOG_LEVEL is DEBUG (which it is by default, line 28-29 of lib.rs), the retry messages are at debug level, invisible to most users.
  - The error message itself may be unclear — a reqwest timeout wraps into CasClientError::from(reqwest_middleware::Error) which may not say "shard upload timed out after 120s" in a user-friendly way.

@rajatarya rajatarya requested a review from seanses March 11, 2026 04:41
@XciD
Copy link
Member

XciD commented Mar 11, 2026

did you find why the error was silenced ? worth investigating if we are silenting some other errors. Could have been a retry policy failling over and over in my case, but worth checking

From the analysis it should have reported an ERROR message in the logs after all retries were exhausted. Each retry should be in the logs at INFO level. What is less clear is what makes its way back to Python - but for the testing you were doing that was in Rust, right?

Here is the writeup for the swallowed errors from Claude:

  Why Timeout Errors Appear Silent

  The error path is: reqwest timeout → RetryWrapper → upload_shard → shard_interface → file_upload_session::finalize → upload_commit → user.

  The error does propagate — it's not silently discarded in code. But here's why it appeared silent:

  Root Cause: Timeouts are classified as Retryable::Transient

  In on_request_failure (line 473):
  if error.is_timeout() || is_connect {
      Some(Retryable::Transient)
  }

  This means:

  1. Each timeout is logged at info level, not error — process_error_response logs transient errors with log_as_info = true (line 115). If the operator or user doesn't have info-level logging enabled, they see nothing during the retry attempts.
  2. The retry loop silently retries 3 times (default retry_max_attempts). With the old 120s read_timeout, that's up to 6 minutes of silent retries before failing. Each retry hits the same timeout.
  3. The final "No more retries; aborting" IS logged at error level (line 320), but only after exhausting all retries.
  4. From the user's perspective: the upload hangs for ~6 minutes (3 × 120s timeouts), then fails with a generic CasClientError that wraps the reqwest timeout. The error message may not clearly indicate "shard upload timed out" — it's wrapped in layers of error types.

  Contributing factors:

  - No progress indication during shard upload — unlike xorb uploads which have progress callbacks, shard upload has no intermediate progress reporting. The user sees the upload "freeze" with no explanation.
  - Default log level may hide the info-level retry messages — if INFORMATION_LOG_LEVEL is DEBUG (which it is by default, line 28-29 of lib.rs), the retry messages are at debug level, invisible to most users.
  - The error message itself may be unclear — a reqwest timeout wraps into CasClientError::from(reqwest_middleware::Error) which may not say "shard upload timed out after 120s" in a user-friendly way.

It was rust, but in a e2e test were I didnt had the logs. So I was maybe block on a retry loop

Copy link
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's get this in.

@rajatarya rajatarya merged commit 83a2827 into main Mar 11, 2026
7 checks passed
@rajatarya rajatarya deleted the rajat/shard-upload-per-request-timeout branch March 11, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants