Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

fix(network): set explicit timeouts on default reqwest client; enable CI tracing#267

Merged
zkochan merged 3 commits into
mainfrom
fix/network-timeouts-and-ci-trace
Apr 24, 2026
Merged

fix(network): set explicit timeouts on default reqwest client; enable CI tracing#267
zkochan merged 3 commits into
mainfrom
fix/network-timeouts-and-ci-trace

Conversation

@zkochan

@zkochan zkochan commented Apr 24, 2026

Copy link
Copy Markdown
Member

What this fixes

CI runs of Integrated-Benchmark frequently hang at:

Compiling pacquet-cli v0.0.1 ...
    Finished `release` profile [optimized] target(s) in 2m 50s
Benchmark 1: pacquet@HEAD
Error: The action 'Benchmark: Frozen Lockfile' has timed out after 10 minutes.

With --show-output already enabled, the absence of any pacquet stderr after Benchmark 1: means the install process genuinely went silent — no progress, no error, no panic. That rules out "bench is slow" and points at something blocking inside pacquet with no deadline on it.

Root cause

crates/network/src/lib.rs:31 hands back Client::new(), which has no connect/request/pool-idle timeouts. The sibling from_client docstring even called this out as a footgun:

Primarily useful for tests that need custom timeouts (a default Client has no connect / request timeout, so a firewall that silently drops packets instead of refusing them can stall the test suite for TCP-retry worth of time).

Every http_client.get(tarball_url).send() in crates/tarball/src/lib.rs:310 is therefore a potential forever-hang. The bench's verdaccio is a single Node process handling 16 concurrent requests from pacquet; on a GHA runner it's easy for the event loop to stall (GC, uplink hiccup, TCP packet loss without RST) for longer than our patience. Pacquet then sits on the half-open socket until the GHA job timeout, which is exactly the symptom.

Pacquet has no retry on transient network failures yet (tracked separately in #259), so even a transient stall becomes a permanent hang in the current code.

Changes

  1. ThrottledClient::new_from_cpu_count now builds its Client with explicit defaults:
    • connect_timeout(10s) — generous for localhost, snaps on resource-exhausted runners.
    • timeout(60s) — per-request deadline. Typical npm tarball < 5 MB, so 60s on localhost has huge margin; a real hang turns into TarballError::FetchTarball and hyperfine reports the non-zero exit instead of silent-hang-until-timeout.
    • pool_idle_timeout(30s) — evicts connections that verdaccio closed but we didn't notice.
  2. from_client docstring updated — it's still the escape hatch for tests that want different (usually shorter) timeouts, not "the only way to get any timeouts at all".
  3. CI bench step now runs with TRACE=pacquet=info. pacquet's tracing subscriber is gated on the TRACE env var; without it, --show-output only ever captured an empty stream. With it, the next stall will show which phase the stuck install was in ("Download completed", "New cache", "Reusing cached CAFS entry") before going silent.

What this doesn't fix

This converts hangs into fast failures rather than preventing the stall itself. If verdaccio genuinely starts dropping requests, pacquet will now surface that as a FetchTarball error → hyperfine exits non-zero → the step fails visibly, but the underlying resource-exhaustion or verdaccio-instability problem still needs addressing. Options once we have the first post-fix log:

Test plan

  • cargo check --workspace --all-targets.
  • cargo test -p pacquet-network -p pacquet-tarball -p pacquet-package-manager — no test regressions (tests that use fast_fail_client with sub-second timeouts still work; they set their own Client::builder).
  • Next CI Integrated-Benchmark run — expected outcome: either the bench runs green (hang was network-stall-related and 60s timeout now catches it), or it fails with TarballError::FetchTarball plus a tracing trail showing the last phase before the stall. Either way, actionable rather than silent.

`ThrottledClient::new_from_cpu_count` handed back `Client::new()`,
which has no connect / request / pool-idle timeouts at all. The
existing docstring on the sibling `from_client` already called this
out as a footgun — that note stays relevant for tests, but the
production path can't afford "wait forever".

On CI this is how `integrated-benchmark` hangs at "Benchmark 1:
pacquet@HEAD" until the 10-min GHA step cap kicks in: the bench's
single-process verdaccio occasionally stalls (GC pause, uplink
hiccup, TCP loss without RST) and pacquet sits on the half-open
socket indefinitely, emitting nothing even under `--show-output`
(see #263 investigation notes).

Set explicit defaults:

* `connect_timeout(10s)` — generous for localhost, snaps on
  resource-exhausted runners.
* `timeout(60s)` — per-request deadline. Typical npm tarball is
  under 5 MB, so 60 s on localhost is comfortable; a real hang
  turns into `TarballError::FetchTarball` and hyperfine reports
  the non-zero exit instead of sitting silent.
* `pool_idle_timeout(30s)` — evict conns verdaccio closed but we
  didn't notice.

Retries on transient failures are tracked separately in #259.

Also enable pacquet tracing on the CI bench step via
`TRACE=pacquet=info`, so the next time one of these stalls makes
it through despite the new timeouts the step log shows which phase
pacquet was in (download, extraction, import, index write) before
it went silent. `--show-output` was already in place; without a
tracing filter it only ever captured an empty stream.

Refs #263.
`[`new_from_cpu_count`]` doesn't resolve from inside the `impl` block
because rustdoc needs a path; `Self::new_from_cpu_count` does. The
Doc CI job runs with `RUSTDOCFLAGS=-D warnings` so the broken link
was a hard error. No behavioural change.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent silent CI hangs during the Integrated-Benchmark by adding explicit timeouts to pacquet’s default reqwest client and by enabling pacquet tracing in the benchmark workflow so future stalls produce actionable logs.

Changes:

  • Build the default ThrottledClient with explicit connect/request/pool-idle timeouts instead of Client::new().
  • Update from_client docs to reflect that new_from_cpu_count now sets defaults and from_client is the override hook (especially for tests).
  • Set TRACE=pacquet=info for the integrated benchmark step to emit tracing output during CI runs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/network/src/lib.rs Switches default HTTP client construction to Client::builder() with explicit timeouts and updates related docs.
.github/workflows/integrated-benchmark.yml Enables pacquet tracing for the benchmark step via TRACE env var to improve CI diagnosability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/network/src/lib.rs
Comment thread crates/network/src/lib.rs Outdated
@codecov

codecov Bot commented Apr 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.65%. Comparing base (edd07c3) to head (6123b61).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #267      +/-   ##
==========================================
- Coverage   89.75%   89.65%   -0.10%     
==========================================
  Files          58       58              
  Lines        4527     4533       +6     
==========================================
+ Hits         4063     4064       +1     
- Misses        464      469       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

60 seconds is too aggressive for a default that governs real
registry traffic, not just the bench's localhost verdaccio. Large
npm tarballs can legitimately take more than a minute over a slow
network, and pacquet has no retry-with-backoff yet (#259) — a
60-second timeout would turn slow-but-progressing downloads into
outright install failures.

5 minutes keeps the \"catch truly stuck sockets\" property (still
well inside the bench's step budget and the CLI's reasonable-user
patience) while comfortably accommodating slow real-world
connections.

Also reword the doc to call out that the value is the default for
registry traffic too, not just the bench, and that making it
user-configurable is the natural next step once retries exist.

Addresses Copilot review on #267.
@github-actions

Copy link
Copy Markdown

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.00    39.6±52.22ms   109.6 KB/sec    1.27   50.1±104.69ms    86.5 KB/sec

@github-actions

Copy link
Copy Markdown

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 1.255 ± 0.029 1.223 1.301 1.00
pacquet@main 1.266 ± 0.033 1.215 1.318 1.01 ± 0.04
pnpm 2.667 ± 0.038 2.618 2.746 2.13 ± 0.06
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 1.2547423268000002,
      "stddev": 0.029335149948103264,
      "median": 1.2513892324000002,
      "user": 0.41937515999999997,
      "system": 2.4228385999999995,
      "min": 1.2228729024,
      "max": 1.3014688984000002,
      "times": [
        1.2644831614,
        1.3014688984000002,
        1.2270366804000001,
        1.2910592794,
        1.2239778654000002,
        1.2335402064,
        1.2786038864000002,
        1.2228729024,
        1.2382953034000002,
        1.2660850844
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 1.2659456904000002,
      "stddev": 0.033120048720833045,
      "median": 1.2573686989000001,
      "user": 0.40887506,
      "system": 2.4628332,
      "min": 1.2149860354000002,
      "max": 1.3175138554,
      "times": [
        1.2547471794,
        1.3175138554,
        1.2517915944,
        1.2149860354000002,
        1.2529949564,
        1.2315173324000002,
        1.2599902184000003,
        1.3073690614000002,
        1.2682369754,
        1.3003096954000002
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 2.6674933733,
      "stddev": 0.03789681013172619,
      "median": 2.6657566009,
      "user": 3.3339536599999997,
      "system": 2.2899702,
      "min": 2.6180032334,
      "max": 2.7461594114,
      "times": [
        2.6412264264,
        2.6466241854,
        2.6870982074,
        2.6547795574,
        2.6849891614,
        2.6272391404,
        2.7461594114,
        2.6920807654,
        2.6767336444,
        2.6180032334
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

@zkochan zkochan merged commit 6e3174c into main Apr 24, 2026
13 of 14 checks passed
@zkochan zkochan deleted the fix/network-timeouts-and-ci-trace branch April 24, 2026 11:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants