Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

fix(tarball): retry transient fetch errors with exponential backoff#301

Merged
zkochan merged 4 commits into
mainfrom
fix-259
Apr 26, 2026
Merged

fix(tarball): retry transient fetch errors with exponential backoff#301
zkochan merged 4 commits into
mainfrom
fix-259

Conversation

@zkochan

@zkochan zkochan commented Apr 26, 2026

Copy link
Copy Markdown
Member

Summary

A single transient network blip during a pacquet install (connection
reset, TLS error, timeout, DNS hiccup, registry 5xx) used to abort the
entire install. pnpm has retried these for years; pacquet now does
too, with the same retry classification pnpm uses.

What changes

  • fetch_and_extract_with_retry wraps the full tarball pipeline
    — request + body buffer + integrity check + gzip decode + tar
    extract — in one retried closure. Mirrors pnpm's
    remoteTarballFetcher.ts:
    the body fetch and addFilesFromTarball (integrity + extraction)
    share a single retry boundary, so a flaky transfer that survives
    TCP framing but fails the SHA-512 hash or trips gzip / tar parsing
    recovers via re-fetch instead of aborting the install.

  • is_transient_error matches pnpm's policy exactly: only HTTP
    401, 403, 404 fail fast (per remoteTarballFetcher.ts:77-85).
    Everything else — arbitrary 4xx (e.g. 410), 5xx, network reset,
    timeout, integrity mismatch, gzip / tar parse error, allocator
    refusal — retries.

  • RetryOpts mirrors @zkochan/retry's formula
    (min(min_timeout * factor.pow(attempt), max_timeout),
    randomize: false) with pnpm's defaults: 2 retries, factor 10,
    10s floor, 60s cap. Same values as
    network/fetch/src/fetch.ts
    and
    config/config/src/index.ts.

  • HTTP status check added to the fetch path. Previously a 4xx
    error body was downloaded and handed to the integrity check as if
    it were a tarball; now non-2xx surfaces as
    TarballError::HttpStatus.

  • Config plumbing: Npmrc gains fetch_retries,
    fetch_retry_factor, fetch_retry_mintimeout,
    fetch_retry_maxtimeout. Read only from pnpm-workspace.yaml
    as camelCase keys (fetchRetries, fetchRetryFactor, …) — pnpm 11s
    isIniConfigKey
    excludes the fetch-retry* family from NPM_AUTH_SETTINGS, so a
    fetch-retries=… line in .npmrc is ignored upstream and is
    ignored here too. Conversion to RetryOpts is centralised in
    package-manager::retry_config::retry_opts_from_config.

  • Permit lifecycle: the post-download semaphore permit is
    acquired inside each attempt so a backoff sleep doesnt park one
    of the small pool of permits and block other downloads. The
    store-index writers queue call moves outside the retry loop
    so transient failures dont enqueue a half-built row that a
    successful retry would duplicate.

Closes #259.

Test plan

  • cargo nextest run -p pacquet-tarball — 25 tests including
    6 new mockito-driven retry tests:
    retries_then_succeeds_on_transient_5xx,
    retries_integrity_mismatch_until_exhausted,
    retries_other_4xx_codes,
    fails_fast_on_404,
    retry_exhaustion_returns_last_error,
    zero_retries_makes_a_single_attempt — plus three unit tests for
    the backoff formula and the is_transient_error classification
    policy.
  • cargo nextest run -p pacquet-npmrc — defaults match pnpm
    (fetch_retries_defaults_match_pnpm); yaml override applies
    (parses_fetch_retry_settings_from_yaml_and_applies); .npmrc
    values are ignored (fetch_retry_keys_in_npmrc_are_ignored).
  • just ready — 226 tests pass, clippy clean.

Copilot AI review requested due to automatic review settings April 26, 2026 09:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds pnpm-compatible retry behavior to the tarball download pipeline so transient network failures and retryable HTTP statuses don’t abort large installs, and wires the retry knobs through Npmrc and pnpm-workspace.yaml.

Changes:

  • Introduces RetryOpts and a fetch_tarball_with_retry loop with exponential backoff plus HTTP status classification.
  • Adds explicit non-2xx handling (TarballError::HttpStatus) so error bodies aren’t treated as tarballs.
  • Plumbs retry configuration through Npmrc/workspace YAML into the package-manager tarball download call sites, and adds tests/bench updates (mockito).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tasks/micro-benchmark/src/main.rs Passes default retry options into the benchmark tarball download path.
deny.toml Updates license note for mockito usage.
crates/tarball/src/lib.rs Implements retry/backoff + HTTP status handling; adds tests for policy and backoff formula; adds RetryOpts API.
crates/tarball/Cargo.toml Adds mockito as a dev-dependency for new retry tests.
crates/package-manager/src/install_package_from_registry.rs Threads retry opts from config into tarball downloads (registry install path).
crates/package-manager/src/install_package_by_snapshot.rs Threads retry opts from config into tarball downloads (snapshot install path).
crates/npmrc/src/workspace_yaml.rs Adds workspace YAML (camelCase) overrides for fetch retry settings and tests them.
crates/npmrc/src/lib.rs Adds Npmrc fields + defaults for fetch retry settings and tests defaults.
crates/npmrc/src/custom_deserializer.rs Adds default providers + u32 deserializer for new config fields.
Cargo.lock Locks new dev dependency (mockito).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/npmrc/src/lib.rs Outdated
Comment thread crates/package-manager/src/install_package_from_registry.rs Outdated
Comment thread crates/package-manager/src/install_package_by_snapshot.rs Outdated
@codecov

codecov Bot commented Apr 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.47355% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.34%. Comparing base (d885847) to head (b071d8b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/tarball/src/lib.rs 97.77% 7 Missing ⚠️
crates/npmrc/src/custom_deserializer.rs 66.66% 6 Missing ⚠️
...package-manager/src/install_package_by_snapshot.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
+ Coverage   88.92%   89.34%   +0.41%     
==========================================
  Files          61       62       +1     
  Lines        5850     6201     +351     
==========================================
+ Hits         5202     5540     +338     
- Misses        648      661      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Apr 26, 2026

Copy link
Copy Markdown

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.02     14.3±0.42ms   302.4 KB/sec    1.00     14.0±0.18ms   309.4 KB/sec

@zkochan zkochan marked this pull request as draft April 26, 2026 09:12
@github-actions

github-actions Bot commented Apr 26, 2026

Copy link
Copy Markdown

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 3.259 ± 0.101 3.126 3.397 1.03 ± 0.04
pacquet@main 3.167 ± 0.065 3.088 3.292 1.00
pnpm 6.978 ± 0.080 6.846 7.112 2.20 ± 0.05
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 3.2590209025999997,
      "stddev": 0.10085894487248812,
      "median": 3.2477698236,
      "user": 2.56501376,
      "system": 3.71205634,
      "min": 3.1263637821,
      "max": 3.3968742881,
      "times": [
        3.3968742881,
        3.3785545421,
        3.2552757891,
        3.3351852741,
        3.1263637821,
        3.1425716021,
        3.1441140801,
        3.2292467591,
        3.2402638581,
        3.3417590511
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 3.1670690879,
      "stddev": 0.06507459339948952,
      "median": 3.1781605220999998,
      "user": 2.5493166599999997,
      "system": 3.6835714399999993,
      "min": 3.0875849821,
      "max": 3.2918285321,
      "times": [
        3.1295410781,
        3.1047402711,
        3.2918285321,
        3.0875849821,
        3.1948203441,
        3.1693517391,
        3.1869693051,
        3.0898106941,
        3.1931567031,
        3.2228872301
      ]
    },
    {
      "command": "pnpm",
      "mean": 6.978279460200001,
      "stddev": 0.07978711300491353,
      "median": 6.970969398100001,
      "user": 9.761149459999999,
      "system": 4.643840840000001,
      "min": 6.8455627011,
      "max": 7.1120239691,
      "times": [
        7.0259703751,
        6.907816153100001,
        6.9114835301,
        6.981271111100001,
        7.0628057221,
        7.1120239691,
        7.0226244301000005,
        6.9525689251000005,
        6.960667685100001,
        6.8455627011
      ]
    }
  ]
}

Scenario: Frozen Lockfile (Hot Cache)

Command Mean [ms] Min [ms] Max [ms] Relative
pacquet@HEAD 768.2 ± 37.2 735.0 858.3 1.00 ± 0.06
pacquet@main 766.2 ± 23.8 746.7 819.3 1.00
pnpm 2818.6 ± 75.9 2685.5 2935.7 3.68 ± 0.15
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 0.7682318555000001,
      "stddev": 0.03723167382603488,
      "median": 0.7567669875,
      "user": 0.25785386,
      "system": 1.59361306,
      "min": 0.7349992355000001,
      "max": 0.8582601525000001,
      "times": [
        0.8582601525000001,
        0.7789673185000001,
        0.7640413675000001,
        0.7945433495,
        0.7369715185000001,
        0.7349992355000001,
        0.7494926075000001,
        0.7420998735000001,
        0.7484615725000001,
        0.7744815595000001
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 0.7662386366000001,
      "stddev": 0.023762577247759417,
      "median": 0.7572026740000001,
      "user": 0.26837115999999994,
      "system": 1.60400356,
      "min": 0.7466548175000001,
      "max": 0.8192907275000001,
      "times": [
        0.7736350415000001,
        0.7570125805000001,
        0.7466548175000001,
        0.7573927675000001,
        0.7954770585000001,
        0.7616217935000001,
        0.7489555445000001,
        0.7468740365000001,
        0.8192907275000001,
        0.7554719985000001
      ]
    },
    {
      "command": "pnpm",
      "mean": 2.8186429752000004,
      "stddev": 0.07589682842334912,
      "median": 2.8201898275,
      "user": 3.3987815600000006,
      "system": 2.2929008599999996,
      "min": 2.6854513064999996,
      "max": 2.9357386585,
      "times": [
        2.7568931525,
        2.9357386585,
        2.9091428505,
        2.8515235564999997,
        2.7752680395,
        2.8385724915,
        2.8018071635,
        2.7673782035,
        2.6854513064999996,
        2.8646543295
      ]
    }
  ]
}

zkochan added a commit that referenced this pull request Apr 26, 2026
Both `install_package_by_snapshot` and `install_package_from_registry`
were spelling out the same `Npmrc` -> `RetryOpts` field-by-field
mapping. Move it to a shared `retry_config::retry_opts_from_config`
helper so the two install paths can't drift if a knob ever changes
(Copilot review on #301).

Also tighten the `Npmrc::fetch_retries` doc — only the tarball path is
wired up today; metadata fetches in `crates/registry` still go without
retry. Note the gap so a future contributor doesn't read the doc and
assume coverage we don't yet have.
@zkochan zkochan requested a review from Copilot April 26, 2026 10:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
zkochan added a commit that referenced this pull request Apr 26, 2026
The doc on `DownloadTarballToStore::retry_opts` still described the
narrower "5xx + {408,409,420,429}" policy from the first commit;
since the second commit, the actual policy is "every failure except
HTTP 401/403/404". Update to match (Copilot review on #301).
@zkochan zkochan requested a review from Copilot April 26, 2026 10:42
@zkochan zkochan marked this pull request as ready for review April 26, 2026 10:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs
zkochan added a commit that referenced this pull request Apr 26, 2026
Lets reqwest/hyper return the connection to the keep-alive pool when
we abort on a non-success status, instead of closing it. On 5xx + retry
that means we keep the warm TCP+TLS instead of paying for a fresh
handshake. Capped at 64 KiB and gated on Content-Length, since hyper
only reuses the connection once the body is fully consumed — a partial
drain wouldn't help and would buffer pathological responses.

Copilot review on #301.
@zkochan zkochan requested a review from a team April 26, 2026 16:11
zkochan added a commit that referenced this pull request Apr 26, 2026
- `crates/tarball`: link `Display` to `std::fmt::Display`. Bare
  `[\`Display\`]` resolved to `derive_more::Display` (the macro
  brought into scope at the top of the file), not the trait the
  doc comment was actually describing.
- `crates/network`: stop claiming the timeout rationale relies on
  a tarball-side retry loop. #301 ("retry transient fetch errors
  with exponential backoff") is still open, so no retry exists
  today; reword to make clear the 5-minute cap is here to catch
  stuck sockets rather than to mask short-lived failures.

@KSXGitHub KSXGitHub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some opinions of mine.

Comment thread crates/npmrc/src/custom_deserializer.rs
Comment thread crates/npmrc/src/workspace_yaml.rs
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs
Copilot AI review requested due to automatic review settings April 26, 2026 20:01
zkochan added a commit that referenced this pull request Apr 26, 2026
The timeout paragraph cited #263 (the benchmark-hang incident) and
#301 (retry-loop tracking issue). Both will rot: #263 is a closed
artifact and #301 will land separately, at which point the
"pacquet does not yet retry" sentence becomes wrong. Rephrase
without issue numbers so the rationale is timeless.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated

@KSXGitHub KSXGitHub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zkochan Please wait for the AI to complete the task before merging this PR.

Comment thread crates/tarball/src/lib.rs Outdated

@KSXGitHub KSXGitHub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code, I forgot. Please use 10 characters for commit SHA as per the new AGENTS.md in main.

Copilot AI review requested due to automatic review settings April 26, 2026 20:24
Wraps the full tarball pipeline (request + body buffer + integrity
check + gzip decode + tar extract) in pnpm's retry policy so a single
flaky transfer doesn't abort the install. Mirrors
`fetching/tarball-fetcher/src/remoteTarballFetcher.ts`: only HTTP 401,
403, 404 fail fast; everything else — arbitrary 4xx, 5xx, network
reset, timeout, integrity mismatch, gzip / tar parse error — retries
until the budget is exhausted.

`RetryOpts` mirrors `@zkochan/retry`'s formula
(`min(min_timeout * factor.pow(attempt), max_timeout)`,
`randomize: false`) with pnpm's defaults: 2 retries, factor 10,
10 s floor, 60 s cap. Conversion from `Npmrc` is centralised in
`package-manager::retry_config::retry_opts_from_config`.

Config plumbing: `Npmrc` gains `fetch_retries`, `fetch_retry_factor`,
`fetch_retry_mintimeout`, `fetch_retry_maxtimeout`. Read only from
`pnpm-workspace.yaml` as camelCase keys — pnpm 11's `isIniConfigKey`
excludes the `fetch-retry*` family from `NPM_AUTH_SETTINGS`, so a
`fetch-retries=…` line in `.npmrc` is ignored upstream and is
ignored here too.

HTTP status check: previously a 4xx error body was downloaded and
handed to the integrity check as if it were a tarball; now non-2xx
surfaces as `TarballError::HttpStatus`, with small error bodies
drained (capped at 64 KiB, gated on Content-Length) so reqwest/hyper
can return the connection to the keep-alive pool on retry.

Permit lifecycle: the post-download semaphore permit is acquired
inside each attempt so a backoff sleep doesn't park one of the
small pool of permits and block other downloads. The store-index
writer's queue call moves outside the retry loop so transient
failures don't enqueue a half-built row that a successful retry
would duplicate.

Closes #259.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/registry/src/package.rs Outdated
Comment thread crates/registry/src/package_version.rs Outdated
Comment thread crates/tarball/src/lib.rs
Comment thread crates/network/src/lib.rs Outdated
Comment thread crates/network/src/lib.rs
…ay_for

Per CODE_STYLE_GUIDE.md and KSXGitHub's review on #301: keep the
expression chain reading left-to-right rather than wrapping the
receiver in a `u64::try_from(...)` call.
Copilot AI review requested due to automatic review settings April 26, 2026 20:34
…ract_once

After the rebase onto current main, fetch_and_extract_once lost the
rationale for why network and post_download permits are released and
re-acquired across body-buffering vs. spawn_blocking. Port the
explanation that lived in download_tarball_to_store before the retry
refactor moved this code: the ordering is the deliberate fix from
perf(tarball) a43ca32 — gating body streaming with the smaller
post_download cap would pin network_concurrency permits and serialize
fetches behind decompression.

Surfaces here as a Copilot review concern on this PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs
Copilot flagged that `factor == 0` makes `delay_for` return 0 ms for
every retry past the first, effectively spinning. That matches pnpm /
`@zkochan/retry` exactly (`Math.pow(0, n>0) === 0`), so silently
clamping in `delay_for` would diverge from upstream. The retry total
is bounded by `retries` regardless, so the worst case is "no backoff
between attempts" rather than an unbounded loop.

Document this and the other degenerate cases (`factor == 1`,
`max_timeout < min_timeout`) on the struct so future readers don't
think it's an oversight, and call out where stricter validation
should live if pnpm ever grows it.
@zkochan zkochan merged commit 6a6eee3 into main Apr 26, 2026
14 checks passed
@zkochan zkochan deleted the fix-259 branch April 26, 2026 20:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tarball fetch has no retry on transient network errors

3 participants