fix(registry): surface retry warnings and cap timeout retries at 1#331
fix(registry): surface retry warnings and cap timeout retries at 1#331
Conversation
Retries on tarball/packument failures previously logged at debug level, hidden from users at the default loglevel. A flaky CDN edge that took several retries to recover (or failed entirely) gave no visible signal that recovery was attempted. Switch retry logs to warn so the user sees "retrying HTTP request after ..." in the install output. Also cap timeout-shaped retries on tarball downloads at 1 attempt so a degraded upstream can't drag the install through `fetchTimeout × (retries + 1)` of wall-clock; other failure modes (5xx, 429, transport, body decode) still retry the full `fetchRetries` budget. Expand the default tracing filter to include every aube_* library crate so library-emitted warns reach users without `AUBE_LOG` overrides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryPromotes all retry log events from Confidence Score: 5/5Safe to merge — logic is correct, tests cover the critical cap paths, and the scope is well-contained to the tarball retry function. No bugs found. The No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "[autofix.ci] apply automated fixes" | Re-trigger Greptile |
The previous test only exercised the connection/headers-stage timeout path (the outer Err arm of `send().await`). The original reproducer was a body-read timeout — server sent headers, then trickled the tarball body until reqwest's timeout fired mid-stream. That path goes through the `read_body_capped` -> `timeout_retry_exhausted` guard which had no direct test. Added a raw-TCP listener test that sends 200 + Content-Length headers immediately, then stalls the body. Asserts only 2 connections are accepted regardless of `fetchRetries`. Renamed the existing test to clarify it covers the headers-stage path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c89195f. Configure here.
The previous implementation gated the timeout retry cap on the global `attempt` counter, which increments for every error type. If a 503 occurred on attempt=0 and a timeout on attempt=1, the cap fired immediately (1 >= TIMEOUT_RETRY_CAP) and the user got zero timeout retries instead of the intended one. Track timeout retries in a separate counter that only increments when the failure being retried is timeout-shaped. Non-timeout failures (5xx, 429, transport errors, body decode) no longer consume the timeout budget. Added a mixed-error test that pins the 503-then-timeout-then-timeout sequence at 3 server requests, and folded the previously-split match arms in `retry_bytes_body_read` into a single arm per error site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
retrying HTTP request after ...) now surface atwarninstead ofdebug, so a user watching an install can see that recovery was attempted on a flaky upstream.fetchTimeout × (retries + 1)of wall-clock. Other failure modes (5xx, 429, transport, body-decode) still retry the fullfetchRetriesbudget.aube_*library crate so library warns reach the user at the default loglevel without needingAUBE_LOGoverrides.Why
Reproducer was a workerd binary tarball that trickled for ~70s and then failed with
error decoding response body. Retries already happened internally, but the user had no visible signal — and on a sustained CDN regression the existing policy would let three timed-out attempts compound into a multi-minute hang.Notes for reviewers
retry_bytes_body_read(tarball path), not to packument metadata loops. Metadata fetches are small/fast and the existing retry budget is cheap there. Easy to extend later if we hit a flake on the metadata side.fetchRetries,fetchTimeout, andfetchMinSpeedKiBpssemantics are unchanged for pnpm parity.tarball_timeout_retries_at_most_once_even_with_high_retry_budgetpins the cap behavior: withretries=5and a server that always times out, the mock sees exactly 2 requests.Test plan
cargo clippy --all-targets -- -D warningscargo test --workspacecargo fmt --checkaube installagainst a project with@cloudflare/workerd-*and observe a clean installAUBE_FETCH_TIMEOUT=50against a real registry and confirm exactly 2 attempts before failure🤖 Generated with Claude Code
Note
Medium Risk
Changes retry behavior for tarball downloads by limiting timeout-shaped retries, which can alter install success/failure timing on slow networks; also increases log verbosity at default levels and broadens default crate filtering.
Overview
Improves install-time observability by emitting registry retry messages at
warn(instead ofdebug) across metadata and tarball fetch retry loops.Caps timeout-shaped retries during tarball downloads to one extra attempt (independent of
fetchRetries) to avoid long hangs on repeated timeouts, and adds end-to-end tests covering header-timeout, body-read timeout, and mixed 503+timeout scenarios.Expands the default
AUBE_LOGfilter to include allaube_*library crates so warnings from those crates are visible without custom log config.Reviewed by Cursor Bugbot for commit 3bb6a6e. Bugbot is set up for automated code reviews on this repo. Configure here.