fix(tarball): run post-download pipeline on blocking pool by zkochan · Pull Request #269 · pnpm/pacquet

zkochan · 2026-04-24T12:38:30Z

Summary

The frozen-lockfile benchmark job has been flaky, with the 10-minute step timeout firing during Benchmark 1: pacquet@HEAD (e.g. this run's log). Diagnosis:

All 8112 tarballs reached Download completed — network side was fine (the reqwest timeouts from fix(network): set explicit timeouts on default reqwest client; enable CI tracing #267 did their job).
Only 6997 reached Checksum verified — ~1115 stuck in the post-download stage.
Last tracing event at 12:07:55, then 3m46s of total silence before the runner SIGKILLed the job. pacquet was still alive in the orphan-process list.

The stuck stage was this block in crates/tarball/src/lib.rs:

let cas_paths = tokio::task::spawn(async move {
    package_integrity.check(&response)...;            // SHA-512 over full tarball
    let mut archive = decompress_gzip(&response...);  // gzip inflate
    for entry in entries {                            // per-file SHA-512 + blocking write
        ...
        store_dir.write_cas_file(&buffer, ...)...;
    }
    tokio::task::spawn_blocking(...).await ...;       // SQLite INSERT
})

The body is entirely synchronous CPU + blocking I/O with no .await until the final nested spawn_blocking. Every tarball pins a tokio reactor worker for its full duration. On a 2-core GHA runner that caps concurrency at ~2 tarballs, and the tail of large packages (100-200 KB compressed, MB+ inflated, dozens of files each) overruns the 10-minute step budget. The per-second throughput trace makes it obvious:

time	downloaded	verified
12:07:27-51	~350/s	~350/s
12:07:53	750	237 (consumer falling behind)
12:07:54	526	0
12:07:55 → 12:11:41	0	0

Change

Outer tokio::task::spawn(async move { ... }) → tokio::task::spawn_blocking(move || { ... }). The work now runs on tokio's blocking pool (default 512 threads) instead of the 2 reactor workers.
The inner spawn_blocking around the SQLite write is no longer useful — we're already on the blocking pool — so it's inlined as a direct synchronous call.
Comments updated, including the scope comment (the !Send tar::Archive drop-ordering reasoning no longer applies; the scope is kept for memory efficiency since the inflated buffer can be multi-MB).

What this does NOT address

CreateVirtualDirBySnapshot.run() in crates/package-manager/src/install_package_by_snapshot.rs is still called synchronously from async context, and internally uses rayon::par_iter (create_cas_files, create_symlink_layout), which also blocks a tokio worker. That's a smaller window than the tarball pipeline (dozens of files vs. the full SHA + gunzip + per-file loop), but if CI still flakes after this, that's the next thing to wrap in spawn_blocking.

Test plan

cargo check --workspace — clean.
cargo test -p pacquet-tarball — 7/7 pass (including packages_under_orgs_should_work, which actually downloads from npmjs.org).
cargo test -p pacquet-package-manager — 28/28 pass (including should_install_dependencies, the end-to-end install flow).
CI run of the Benchmark: Frozen Lockfile job — this is the real verification; the hang is timing-dependent and only triggers under the 1352-snapshot fixture's load profile.

The `frozen-lockfile` benchmark job has been flaky on CI with the 10-minute step timeout firing during `Benchmark 1: pacquet@HEAD`. The logs show all downloads completing from the network's point of view but ~1115 tarballs stuck between "Download completed" and "Checksum verified", then ~3m46s of total tracing silence before the runner kills the process. Cause: the post-download work (SHA-512 of the full tarball, gzip inflate, per-file SHA-512, `write_cas_file`, SQLite index INSERT) ran inside `tokio::task::spawn(async move { ... })`. The body contains no `.await` until the final nested `spawn_blocking` for the SQLite write, so each tarball pins a tokio reactor worker for its whole duration. On a 2-core GHA runner that caps throughput at ~2 tarballs in flight, and the tail of large packages (100-200 KB compressed, MB+ inflated, dozens of files each) overruns the CI step budget. Swap the outer `spawn` to `spawn_blocking` so the work runs on tokio's blocking pool (default 512 threads) instead. The nested `spawn_blocking` around the SQLite write is no longer useful once we're already on the blocking pool, so inline it.

codecov · 2026-04-24T12:43:39Z

Codecov Report

❌ Patch coverage is 94.18605% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.52%. Comparing base (5d06cd5) to head (de63632).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/tarball/src/lib.rs	94.18%	5 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #269   +/-   ##
=======================================
  Coverage   89.52%   89.52%           
=======================================
  Files          58       58           
  Lines        4619     4630   +11     
=======================================
+ Hits         4135     4145   +10     
- Misses        484      485    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-04-24T12:44:56Z

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.00     14.0±0.22ms   309.0 KB/sec    1.01     14.2±0.54ms   305.1 KB/sec

Copilot

Pull request overview

This PR addresses CI flakiness in the tarball download/extract pipeline by moving the CPU-bound + blocking post-download processing off Tokio’s reactor workers and onto the blocking thread pool, preventing the async runtime from being pinned during checksum/decompression/CAS writes and SQLite updates.

Changes:

Run the full post-download tarball pipeline inside tokio::task::spawn_blocking instead of tokio::task::spawn.
Inline the SQLite index write (remove the now-redundant nested spawn_blocking).
Update comments to reflect the new execution model and rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Conflicts: # crates/tarball/src/lib.rs

Two changes addressing Copilot review comments on #269: * Gate the post-download `spawn_blocking` body with a global `Semaphore(num_cpus * 2, floor 4)`. The body is CPU-bound (SHA-512 over the whole tarball + per-file SHA-512 + gzip inflate) with some blocking FS I/O, and the default 512-thread blocking pool let `try_join_all` fan out hundreds of these at once — good for I/O wait, disastrous for CPU work on a 2-core runner. The permit is held across the `spawn_blocking.await` so no task enters the body until one is free. * Replace `.expect(\"tarball-processing blocking task panicked\")` on the spawn_blocking `JoinHandle` with `.map_err(TarballError::TaskJoin)?`. Runtime cancellation or a blocking-task panic now surfaces as an error rather than unwinding the whole install. `TarballError::TaskJoin` already exists for this.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-24T13:25:40Z

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command	Mean [s]	Min [s]	Max [s]	Relative
`pacquet@HEAD`	2.664 ± 0.069	2.562	2.790	1.03 ± 0.04
`pacquet@main`	2.596 ± 0.064	2.509	2.710	1.00
`pnpm`	5.876 ± 0.064	5.803	5.998	2.26 ± 0.06

BENCHMARK_REPORT.json

{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 2.66391291706,
      "stddev": 0.06850634720034338,
      "median": 2.66042407316,
      "user": 2.7733771999999997,
      "system": 2.1696338400000004,
      "min": 2.56184656316,
      "max": 2.78953393016,
      "times": [
        2.78953393016,
        2.6591420671600003,
        2.56184656316,
        2.66170607916,
        2.75062312816,
        2.67064702416,
        2.64672620416,
        2.6704100891600002,
        2.65410467916,
        2.57438940616
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 2.59560776136,
      "stddev": 0.06397719749447535,
      "median": 2.57909789716,
      "user": 2.7739412999999997,
      "system": 2.13686164,
      "min": 2.50921192316,
      "max": 2.71001720916,
      "times": [
        2.71001720916,
        2.54344085416,
        2.58397976916,
        2.57421602516,
        2.60857426916,
        2.6530442081600003,
        2.50921192316,
        2.54783886316,
        2.55532571616,
        2.67042877616
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 5.87616439526,
      "stddev": 0.06385839524139639,
      "median": 5.87455807816,
      "user": 8.685683000000001,
      "system": 2.72505444,
      "min": 5.802875316160001,
      "max": 5.99753820716,
      "times": [
        5.99753820716,
        5.82255908716,
        5.85518752716,
        5.89392862916,
        5.89710545416,
        5.802875316160001,
        5.90644541416,
        5.94764547516,
        5.8170060581600005,
        5.82135278416
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Addresses Copilot review on #269: `post_download_semaphore().acquire()` used to run *after* `.bytes().await`, so under `try_join_all` fan-out a fast registry could leave hundreds of fully-buffered tarball bodies sitting in RAM waiting for a permit to process — a significant peak-RSS risk on large installs. Split the HTTP call into `send()` (headers) and `bytes()` (body) and acquire the permit between them. The permit now covers both buffering and processing, so the number of concurrently-buffered bodies is bounded by the post-download cap (`num_cpus * 2`, floor 4). HTTP headers-only concurrency is still gated separately by `ThrottledClient` so the cheap head phase can still fan out.

zkochan added 2 commits April 24, 2026 14:37

style(tarball): apply rustfmt

8dbc930

zkochan requested a review from Copilot April 24, 2026 12:52

Copilot started reviewing on behalf of zkochan April 24, 2026 12:52 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/tarball/src/lib.rs

Comment thread crates/tarball/src/lib.rs Outdated

Comment thread crates/tarball/src/lib.rs

Comment thread crates/tarball/src/lib.rs

Comment thread crates/tarball/src/lib.rs

zkochan added 2 commits April 24, 2026 15:09

merge main after #265/#268/#264 landed

87c5fe1

# Conflicts: # crates/tarball/src/lib.rs

zkochan mentioned this pull request Apr 24, 2026

tarball: propagate malformed-tarball errors instead of panicking in per-entry loop #270

Closed

zkochan requested a review from Copilot April 24, 2026 13:20

Copilot started reviewing on behalf of zkochan April 24, 2026 13:20 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/tarball/src/lib.rs Outdated

zkochan merged commit ad3a582 into main Apr 24, 2026
14 checks passed

zkochan deleted the timeout-issue branch April 24, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tarball): run post-download pipeline on blocking pool#269

fix(tarball): run post-download pipeline on blocking pool#269
zkochan merged 5 commits into
mainfrom
timeout-issue

zkochan commented Apr 24, 2026

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zkochan commented Apr 24, 2026

Summary

Change

What this does NOT address

Test plan

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Micro-Benchmark Results

Linux

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading