fix: per-request timeout for shard uploads (XET-885) by rajatarya · Pull Request #685 · huggingface/xet-core

rajatarya · 2026-03-06T05:24:20Z

Summary

Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default read_timeout(120s) on the reqwest client kills these uploads.

Key insight: reqwest's per-request RequestBuilder::timeout() does NOT override the client-level read_timeout() — they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective.

Fix: Create a dedicated shard_upload_http_client on RemoteClient with no read_timeout, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client.

Changes

`cas_client/src/http_client.rs`

Added reqwest_client_no_read_timeout() — creates a reqwest client with no read_timeout
Added build_auth_http_client_no_read_timeout() — public API wrapping it with middleware
4 unit tests for the new builder

`cas_client/src/remote_client.rs`

Added shard_upload_http_client field to RemoteClient (cfg'd out on wasm)
upload_shard() uses the pre-built no-timeout client instead of building one per request

`cas_client/tests/test_shard_upload_timeout.rs`

Updated: slow server test now asserts success (shard uploads should wait as long as needed)

`xet_config/src/groups/client.rs`

Removed shard_read_timeout config field (no longer needed)

Verification

```bash

All tests pass

cargo test -p cas_client # 87 lib + 2 integration + 2 doc tests

Clippy clean

cargo clippy -p cas_client -- -D warnings

Format clean

cargo +nightly fmt --check
```

rajatarya · 2026-03-06T05:27:23Z

@hoytak Do you think I should add a config override for this so users can set an env variable to configure this?

danielhanchen · 2026-03-06T07:00:30Z

Nice work thanks! Would be nice if there was a patch pypi release for this soon!

hoytak · 2026-03-07T03:32:25Z

If this fixes it, then LGTM!

A config var would make sense -- the max timeout -- e.g. large shard timeout? There are situations where 5 min may not be enough.

rajatarya · 2026-03-07T06:14:07Z

If this fixes it, then LGTM!

A config var would make sense -- the max timeout -- e.g. large shard timeout? There are situations where 5 min may not be enough.

This PR has the max time be 10m (clamped at 600s) - which I imagine should be enough for 128MB shard - which I believe is the max shard size.

I am building out some tests for this and then will merge.

rajatarya · 2026-03-08T21:52:30Z

Fortunately @XciD did some actual testing with this patch and learned that the per-request timeout does not take precedence over the overall read_timeout timing. If overall read_timeout is set then per-request being longer doesn't matter.

Need to rethink this fix. Also, the timeout is silently failing - need to add testing to confirm that when timeout occurs we need to actually fail the upload with a reasonable explanation.

More in internal Slack thread: https://huggingface.slack.com/archives/C087TU2FE3G/p1772992380446739?thread_ts=1772768281.424279&cid=C087TU2FE3G

Shard uploads can exceed the global 120s read_timeout when the server processes large shards with many file entries. Add a size-proportional per-request timeout (~1s/KB, clamped to [120s, 600s]) using reqwest's RequestBuilder::timeout() so only shard uploads get the extended window. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

reqwest's WASM client does not support per-request `.timeout()` on RequestBuilder, which breaks the WASM build after the shard upload timeout was added. Gate the `Duration` import and the `.timeout()` call behind `#[cfg(not(target_family = "wasm"))]` so the timeout applies only to native builds where it is needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…885) reqwest's per-request timeout() only overrides ClientBuilder::timeout(), NOT ClientBuilder::read_timeout(). They are independent mechanisms polled as separate futures. The previous approach of setting per-request timeout() was ineffective — the client's read_timeout(120s) still killed uploads when the CAS server stalled during shard processing. Fix: create a non-cached reqwest client per shard upload with the dynamic timeout (scaled at ~1s/KB, clamped to [120s, 600s]) set as its read_timeout. Store construction parameters (auth, session_id, socket path, headers) in RemoteClient to build shard-specific clients on demand. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-885) PR #648 refactored LocalTestServer to use a builder pattern. Update the shard upload timeout integration tests to use LocalTestServerBuilder instead of LocalTestServer::start(). Also call set_api_delay_range directly on the server instead of going through .client(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cas_client/src/remote_client.rs

XciD · 2026-03-10T07:25:10Z

did you find why the error was silenced ? worth investigating if we are silenting some other errors.
Could have been a retry policy failling over and over in my case, but worth checking

Instead of creating a new reqwest client per shard upload request, store a single no-read-timeout client in RemoteClient at construction time. This avoids the overhead of building a new client per request while still solving the core problem: reqwest's per-request timeout() does NOT override the client-level read_timeout(), so shard uploads need a separate client without read_timeout. Changes: - Renamed build_auth_http_client_with_read_timeout to build_auth_http_client_no_read_timeout (no timeout param needed) - Added shard_upload_http_client field to RemoteClient (cfg'd out on wasm) - Removed per-upload client construction from upload_shard() - Removed shard_read_timeout config field (no longer needed) - Removed stored construction params from RemoteClient (no longer needed) - Updated integration test: slow server test now asserts success (the whole point is shard uploads should wait as long as needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rajatarya · 2026-03-11T04:39:20Z

did you find why the error was silenced ? worth investigating if we are silenting some other errors. Could have been a retry policy failling over and over in my case, but worth checking

From the analysis it should have reported an ERROR message in the logs after all retries were exhausted. Each retry should be in the logs at INFO level. What is less clear is what makes its way back to Python - but for the testing you were doing that was in Rust, right?

Here is the writeup for the swallowed errors from Claude:

  Why Timeout Errors Appear Silent

  The error path is: reqwest timeout → RetryWrapper → upload_shard → shard_interface → file_upload_session::finalize → upload_commit → user.

  The error does propagate — it's not silently discarded in code. But here's why it appeared silent:

  Root Cause: Timeouts are classified as Retryable::Transient

  In on_request_failure (line 473):
  if error.is_timeout() || is_connect {
      Some(Retryable::Transient)
  }

  This means:

  1. Each timeout is logged at info level, not error — process_error_response logs transient errors with log_as_info = true (line 115). If the operator or user doesn't have info-level logging enabled, they see nothing during the retry attempts.
  2. The retry loop silently retries 3 times (default retry_max_attempts). With the old 120s read_timeout, that's up to 6 minutes of silent retries before failing. Each retry hits the same timeout.
  3. The final "No more retries; aborting" IS logged at error level (line 320), but only after exhausting all retries.
  4. From the user's perspective: the upload hangs for ~6 minutes (3 × 120s timeouts), then fails with a generic CasClientError that wraps the reqwest timeout. The error message may not clearly indicate "shard upload timed out" — it's wrapped in layers of error types.

  Contributing factors:

  - No progress indication during shard upload — unlike xorb uploads which have progress callbacks, shard upload has no intermediate progress reporting. The user sees the upload "freeze" with no explanation.
  - Default log level may hide the info-level retry messages — if INFORMATION_LOG_LEVEL is DEBUG (which it is by default, line 28-29 of lib.rs), the retry messages are at debug level, invisible to most users.
  - The error message itself may be unclear — a reqwest timeout wraps into CasClientError::from(reqwest_middleware::Error) which may not say "shard upload timed out after 120s" in a user-friendly way.

XciD · 2026-03-11T08:28:53Z

did you find why the error was silenced ? worth investigating if we are silenting some other errors. Could have been a retry policy failling over and over in my case, but worth checking

From the analysis it should have reported an ERROR message in the logs after all retries were exhausted. Each retry should be in the logs at INFO level. What is less clear is what makes its way back to Python - but for the testing you were doing that was in Rust, right?

Here is the writeup for the swallowed errors from Claude:

  Why Timeout Errors Appear Silent

  The error path is: reqwest timeout → RetryWrapper → upload_shard → shard_interface → file_upload_session::finalize → upload_commit → user.

  The error does propagate — it's not silently discarded in code. But here's why it appeared silent:

  Root Cause: Timeouts are classified as Retryable::Transient

  In on_request_failure (line 473):
  if error.is_timeout() || is_connect {
      Some(Retryable::Transient)
  }

  This means:

  1. Each timeout is logged at info level, not error — process_error_response logs transient errors with log_as_info = true (line 115). If the operator or user doesn't have info-level logging enabled, they see nothing during the retry attempts.
  2. The retry loop silently retries 3 times (default retry_max_attempts). With the old 120s read_timeout, that's up to 6 minutes of silent retries before failing. Each retry hits the same timeout.
  3. The final "No more retries; aborting" IS logged at error level (line 320), but only after exhausting all retries.
  4. From the user's perspective: the upload hangs for ~6 minutes (3 × 120s timeouts), then fails with a generic CasClientError that wraps the reqwest timeout. The error message may not clearly indicate "shard upload timed out" — it's wrapped in layers of error types.

  Contributing factors:

  - No progress indication during shard upload — unlike xorb uploads which have progress callbacks, shard upload has no intermediate progress reporting. The user sees the upload "freeze" with no explanation.
  - Default log level may hide the info-level retry messages — if INFORMATION_LOG_LEVEL is DEBUG (which it is by default, line 28-29 of lib.rs), the retry messages are at debug level, invisible to most users.
  - The error message itself may be unclear — a reqwest timeout wraps into CasClientError::from(reqwest_middleware::Error) which may not say "shard upload timed out after 120s" in a user-friendly way.

It was rust, but in a e2e test were I didnt had the logs. So I was maybe block on a retry loop

hoytak

LGTM, let's get this in.

rajatarya requested review from assafvayner and hoytak March 6, 2026 05:24

hoytak approved these changes Mar 7, 2026

View reviewed changes

XciD approved these changes Mar 8, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

rajatarya and others added 8 commits March 9, 2026 12:54

feat: add shard_read_timeout config field (XET-885)

d4b8f00

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: use shard_read_timeout config in upload_shard (XET-885)

4ebb2de

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add integration tests for shard upload timeout (XET-885)

c951f2f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: update Cargo.lock for ctor dev-dependency (XET-885)

1e34709

rajatarya force-pushed the rajat/shard-upload-per-request-timeout branch from db6b448 to 81431fc Compare March 9, 2026 19:58

rajatarya commented Mar 9, 2026

View reviewed changes

cas_client/src/remote_client.rs Outdated Show resolved Hide resolved

rajatarya requested review from XciD and hoytak March 11, 2026 04:36

rajatarya requested a review from seanses March 11, 2026 04:41

hoytak approved these changes Mar 11, 2026

View reviewed changes

rajatarya merged commit 83a2827 into main Mar 11, 2026
7 checks passed

rajatarya deleted the rajat/shard-upload-per-request-timeout branch March 11, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: per-request timeout for shard uploads (XET-885)#685

fix: per-request timeout for shard uploads (XET-885)#685
rajatarya merged 9 commits intomainfrom
rajat/shard-upload-per-request-timeout

rajatarya commented Mar 6, 2026 •

edited

Loading

Uh oh!

rajatarya commented Mar 6, 2026

Uh oh!

danielhanchen commented Mar 6, 2026

Uh oh!

hoytak commented Mar 7, 2026 •

edited

Loading

Uh oh!

rajatarya commented Mar 7, 2026

Uh oh!

rajatarya commented Mar 8, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

XciD commented Mar 10, 2026 •

edited

Loading

Uh oh!

rajatarya commented Mar 11, 2026

Uh oh!

XciD commented Mar 11, 2026

Uh oh!

hoytak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rajatarya commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

cas_client/src/http_client.rs

cas_client/src/remote_client.rs

cas_client/tests/test_shard_upload_timeout.rs

xet_config/src/groups/client.rs

Verification

All tests pass

Clippy clean

Format clean

Uh oh!

rajatarya commented Mar 6, 2026

Uh oh!

danielhanchen commented Mar 6, 2026

Uh oh!

hoytak commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajatarya commented Mar 7, 2026

Uh oh!

rajatarya commented Mar 8, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

XciD commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajatarya commented Mar 11, 2026

Uh oh!

XciD commented Mar 11, 2026

Uh oh!

hoytak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rajatarya commented Mar 6, 2026 •

edited

Loading

`cas_client/src/http_client.rs`

`cas_client/src/remote_client.rs`

`cas_client/tests/test_shard_upload_timeout.rs`

`xet_config/src/groups/client.rs`

hoytak commented Mar 7, 2026 •

edited

Loading

XciD commented Mar 10, 2026 •

edited

Loading