Skip to content

pacquet: registry metadata fetch issues a single request — port pnpm's retry policy #11841

Description

@KSXGitHub

Summary

fetch_full_metadata in pacquet-resolving-npm-resolver issues exactly one reqwest::Client::send().await and treats any error as fatal. Pnpm's TypeScript implementation routes every registry fetch through make-fetch-happen, which retries transient network errors (fetchRetries, default 2, exponential backoff). The gap is a known follow-up — pacquet's own Config::fetch_retries doc comment at 8695496 says:

Today this only gates the pacquet-tarball download path; crates/registry's metadata fetches still issue a single request. Threading the same retry policy through the registry client is a follow-up.

This issue tracks that follow-up.

Where the gap is

The metadata fetcher at pacquet/crates/resolving-npm-resolver/src/fetch_full_metadata.rs:73-78 (8695496):

let response = request
    .send()
    .await
    .map_err(|error| FetchMetadataError::Network { url: url.clone(), error })?
    .error_for_status()
    .map_err(|error| FetchMetadataError::Network { url: url.clone(), error })?;

No wrapper, no retry, no backoff. The two install_package_* call sites that download tarballs do honor retries — they pass retry_opts_from_config(config) into the tarball path — so the infrastructure exists; it just isn't threaded through to the metadata client.

For comparison, pnpm's fetchFromRegistry.ts at 2a9bd897bf wraps every registry call (metadata and tarball) in make-fetch-happen with the same fetchRetries/fetchRetryFactor/fetchRetryMintimeout/fetchRetryMaxtimeout policy.

Symptom

The user-visible failure is a Failed to fetch metadata from <url>: error sending request for url (<url>) error that aborts the entire install. The wrapped reqwest error covers transient TCP-level failures the make-fetch-happen equivalent would silently retry — chiefly stale keep-alive sockets, which pacquet's network crate documents as a known mode at pacquet/crates/network/src/lib.rs:130-140 (8695496):

pool_idle_timeout(4s) matches agentkeepalive's default freeSocketTimeout. Most CDN / load-balancer edges in front of registry.npmjs.org close idle sockets after 5–15s without sending FIN that hyper notices; a pool TTL above that lets pacquet reuse a half-dead socket and surface the next request as a generic "error sending request for url".

The 4 s pool TTL narrows the race window but doesn't close it; on the tarball download path the RetryOpts wrapper absorbs the same class of error transparently. The metadata path doesn't, so a single stale-socket reuse fails the install.

Reproduction

# Build pacquet@main (8695496 or later)
cargo build --release --bin=pacquet

# Stand up @pnpm/registry-mock 6.0.0's verdaccio (or any verdaccio with
# `proxy: npmjs` against registry.npmjs.org)
node node_modules/.pnpm/node_modules/verdaccio/bin/verdaccio \
    --config registry-mock-config.yaml --listen 4873 &

# Set up the integrated-benchmark `alotta-files` fixture as a clean
# install (no lockfile)
cp pacquet/tasks/integrated-benchmark/src/fixtures/package.json .
cat > .npmrc <<EOF
registry=http://localhost:4873/
auto-install-peers=true
ignore-scripts=true
lockfile=false
EOF
cat > pnpm-workspace.yaml <<EOF
storeDir: ./store-dir
registry: http://localhost:4873/
autoInstallPeers: true
ignoreScripts: true
lockfile: false
EOF

# Run; flakes intermittently with `error sending request for url`
target/release/pacquet install

In a 16-run sample against a freshly-started verdaccio, 1 cold run failed with the symptom above; 15 subsequent warm-pool runs all succeeded with rc=0. The bug isn't deterministic — it's a TCP-pool race — but the per-request failure probability compounded across the ~hundreds of metadata fetches a clean install of alotta-files makes is high enough to flake CI scenarios that go through this path.

Scenarios this affects

Scenarios that don't use a lockfile (clean install, full resolution from package.json) issue hundreds of metadata fetches. Frozen-lockfile scenarios fetch zero metadata — they consume the lockfile's pinned snapshots directly — so they don't expose the gap. This is why the existing CI runs frozen-lockfile and frozen-lockfile-hot-cache cleanly, while attempts to add clean-install / full-resolution to the per-PR integrated-benchmark workflow flake on Benchmark 1: pacquet@HEAD's first hyperfine command.

Suggested shape of the fix

  1. Reuse RetryOpts (or factor it up) so the metadata client gets the same fetch_retries / fetch_retry_factor / fetch_retry_mintimeout / fetch_retry_maxtimeout policy. Default = 2 retries, factor 10, mintimeout 10000 ms, maxtimeout 60000 ms — matching pnpm.
  2. Retries should be scoped to genuinely transient errors (network-level reqwest::Error whose source is a hyper/IO error, plus 5xx / 408 / 429). 4xx other than 408/429 should still be fatal so a misspelt pkg name fails fast.
  3. Honor Retry-After on 429 / 503 if pnpm does (worth a quick check of make-fetch-happen's policy).
  4. Thread retries through both the full and the abbreviated metadata path (both currently bypass it).

Out of scope

  • The pool_idle_timeout value itself. Reducing it narrows the race but trades TCP-handshake overhead for robustness; the right fix is the retry layer that pnpm has. Tracking that here would conflate two changes.
  • PR perf(pacquet): close the warm-cache resolve gap to pnpm CLI #11837's prefetch wiring. That PR resolves and downloads tarballs in parallel; it doesn't touch the metadata-fetch site. The two changes are orthogonal.

Written by an agent (Claude Code, claude-opus-4-7).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions