Skip to content

fix(registry): surface retry warnings and cap timeout retries at 1#331

Merged
jdx merged 4 commits intomainfrom
claude/vibrant-sutherland-f7df3c
Apr 27, 2026
Merged

fix(registry): surface retry warnings and cap timeout retries at 1#331
jdx merged 4 commits intomainfrom
claude/vibrant-sutherland-f7df3c

Conversation

@jdx
Copy link
Copy Markdown
Contributor

@jdx jdx commented Apr 26, 2026

Summary

  • Retry log lines (retrying HTTP request after ...) now surface at warn instead of debug, so a user watching an install can see that recovery was attempted on a flaky upstream.
  • Cap timeout-shaped retries on tarball downloads at one extra attempt — a degraded CDN edge can no longer keep the install hung for fetchTimeout × (retries + 1) of wall-clock. Other failure modes (5xx, 429, transport, body-decode) still retry the full fetchRetries budget.
  • Expand the default tracing filter to include every aube_* library crate so library warns reach the user at the default loglevel without needing AUBE_LOG overrides.

Why

Reproducer was a workerd binary tarball that trickled for ~70s and then failed with error decoding response body. Retries already happened internally, but the user had no visible signal — and on a sustained CDN regression the existing policy would let three timed-out attempts compound into a multi-minute hang.

Notes for reviewers

  • Timeout cap is applied only to retry_bytes_body_read (tarball path), not to packument metadata loops. Metadata fetches are small/fast and the existing retry budget is cheap there. Easy to extend later if we hit a flake on the metadata side.
  • fetchRetries, fetchTimeout, and fetchMinSpeedKiBps semantics are unchanged for pnpm parity.
  • New unit test tarball_timeout_retries_at_most_once_even_with_high_retry_budget pins the cap behavior: with retries=5 and a server that always times out, the mock sees exactly 2 requests.

Test plan

  • cargo clippy --all-targets -- -D warnings
  • cargo test --workspace
  • cargo fmt --check
  • Smoke: run aube install against a project with @cloudflare/workerd-* and observe a clean install
  • Synthetic verification: drop AUBE_FETCH_TIMEOUT=50 against a real registry and confirm exactly 2 attempts before failure

🤖 Generated with Claude Code


Note

Medium Risk
Changes retry behavior for tarball downloads by limiting timeout-shaped retries, which can alter install success/failure timing on slow networks; also increases log verbosity at default levels and broadens default crate filtering.

Overview
Improves install-time observability by emitting registry retry messages at warn (instead of debug) across metadata and tarball fetch retry loops.

Caps timeout-shaped retries during tarball downloads to one extra attempt (independent of fetchRetries) to avoid long hangs on repeated timeouts, and adds end-to-end tests covering header-timeout, body-read timeout, and mixed 503+timeout scenarios.

Expands the default AUBE_LOG filter to include all aube_* library crates so warnings from those crates are visible without custom log config.

Reviewed by Cursor Bugbot for commit 3bb6a6e. Bugbot is set up for automated code reviews on this repo. Configure here.

Retries on tarball/packument failures previously logged at debug level,
hidden from users at the default loglevel. A flaky CDN edge that took
several retries to recover (or failed entirely) gave no visible signal
that recovery was attempted. Switch retry logs to warn so the user sees
"retrying HTTP request after ..." in the install output.

Also cap timeout-shaped retries on tarball downloads at 1 attempt so a
degraded upstream can't drag the install through `fetchTimeout × (retries
+ 1)` of wall-clock; other failure modes (5xx, 429, transport, body
decode) still retry the full `fetchRetries` budget.

Expand the default tracing filter to include every aube_* library crate
so library-emitted warns reach users without `AUBE_LOG` overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

Promotes all retry log events from debug to warn so users can observe recovery attempts at the default log level, caps timeout-shaped retries in retry_bytes_body_read at one extra attempt via a dedicated timeout_retries counter that is independent of the global attempt index, and widens the default AUBE_LOG filter to cover every aube_* library crate. The timeout cap logic is correctly applied to both the header-timeout (Err arm of send().await) and body-read-timeout (read_body_capped inner match) code paths, and three new tests pin the cap behavior across all three scenarios (headers-only timeout, body-read timeout, and mixed 503 + timeout).

Confidence Score: 5/5

Safe to merge — logic is correct, tests cover the critical cap paths, and the scope is well-contained to the tarball retry function.

No bugs found. The timeout_retries counter is correctly scoped and incremented in both timeout-failure arms; the early return before tracing::warn! when the cap fires is intentional and clean. All aube_* crates are accounted for in the expanded filter. Three targeted tests provide good confidence in the cap semantics.

No files require special attention.

Important Files Changed

Filename Overview
crates/aube-registry/src/client.rs All tracing::debug! retry calls promoted to warn; TIMEOUT_RETRY_CAP = 1 added and correctly enforced in both the header-timeout (Err arm) and body-read-timeout (Okread_body_capped arm) paths of retry_bytes_body_read; three new unit tests cover header-timeout cap, body-read-timeout cap, and mixed 503+timeout ordering
crates/aube/src/main.rs Default AUBE_LOG filter expanded to include all ten aube_* library crates; list matches the complete set of crates under crates/

Reviews (4): Last reviewed commit: "[autofix.ci] apply automated fixes" | Re-trigger Greptile

Comment thread crates/aube-registry/src/client.rs
The previous test only exercised the connection/headers-stage timeout
path (the outer Err arm of `send().await`). The original reproducer was
a body-read timeout — server sent headers, then trickled the tarball
body until reqwest's timeout fired mid-stream. That path goes through
the `read_body_capped` -> `timeout_retry_exhausted` guard which had no
direct test.

Added a raw-TCP listener test that sends 200 + Content-Length headers
immediately, then stalls the body. Asserts only 2 connections are
accepted regardless of `fetchRetries`. Renamed the existing test to
clarify it covers the headers-stage path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c89195f. Configure here.

Comment thread crates/aube-registry/src/client.rs Outdated
jdx and others added 2 commits April 26, 2026 18:57
The previous implementation gated the timeout retry cap on the global
`attempt` counter, which increments for every error type. If a 503
occurred on attempt=0 and a timeout on attempt=1, the cap fired
immediately (1 >= TIMEOUT_RETRY_CAP) and the user got zero timeout
retries instead of the intended one.

Track timeout retries in a separate counter that only increments when
the failure being retried is timeout-shaped. Non-timeout failures (5xx,
429, transport errors, body decode) no longer consume the timeout
budget.

Added a mixed-error test that pins the 503-then-timeout-then-timeout
sequence at 3 server requests, and folded the previously-split match
arms in `retry_bytes_body_read` into a single arm per error site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx jdx merged commit 2378abf into main Apr 27, 2026
15 checks passed
@jdx jdx deleted the claude/vibrant-sutherland-f7df3c branch April 27, 2026 00:08
@greptile-apps greptile-apps Bot mentioned this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant