Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

fix(tarball): run post-download pipeline on blocking pool#269

Merged
zkochan merged 5 commits into
mainfrom
timeout-issue
Apr 24, 2026
Merged

fix(tarball): run post-download pipeline on blocking pool#269
zkochan merged 5 commits into
mainfrom
timeout-issue

Conversation

@zkochan

@zkochan zkochan commented Apr 24, 2026

Copy link
Copy Markdown
Member

Summary

The frozen-lockfile benchmark job has been flaky, with the 10-minute step timeout firing during Benchmark 1: pacquet@HEAD (e.g. this run's log). Diagnosis:

  • All 8112 tarballs reached Download completed — network side was fine (the reqwest timeouts from fix(network): set explicit timeouts on default reqwest client; enable CI tracing #267 did their job).
  • Only 6997 reached Checksum verified — ~1115 stuck in the post-download stage.
  • Last tracing event at 12:07:55, then 3m46s of total silence before the runner SIGKILLed the job. pacquet was still alive in the orphan-process list.

The stuck stage was this block in crates/tarball/src/lib.rs:

let cas_paths = tokio::task::spawn(async move {
    package_integrity.check(&response)...;            // SHA-512 over full tarball
    let mut archive = decompress_gzip(&response...);  // gzip inflate
    for entry in entries {                            // per-file SHA-512 + blocking write
        ...
        store_dir.write_cas_file(&buffer, ...)...;
    }
    tokio::task::spawn_blocking(...).await ...;       // SQLite INSERT
})

The body is entirely synchronous CPU + blocking I/O with no .await until the final nested spawn_blocking. Every tarball pins a tokio reactor worker for its full duration. On a 2-core GHA runner that caps concurrency at ~2 tarballs, and the tail of large packages (100-200 KB compressed, MB+ inflated, dozens of files each) overruns the 10-minute step budget. The per-second throughput trace makes it obvious:

time downloaded verified
12:07:27-51 ~350/s ~350/s
12:07:53 750 237 (consumer falling behind)
12:07:54 526 0
12:07:55 → 12:11:41 0 0

Change

  • Outer tokio::task::spawn(async move { ... })tokio::task::spawn_blocking(move || { ... }). The work now runs on tokio's blocking pool (default 512 threads) instead of the 2 reactor workers.
  • The inner spawn_blocking around the SQLite write is no longer useful — we're already on the blocking pool — so it's inlined as a direct synchronous call.
  • Comments updated, including the scope comment (the !Send tar::Archive drop-ordering reasoning no longer applies; the scope is kept for memory efficiency since the inflated buffer can be multi-MB).

What this does NOT address

CreateVirtualDirBySnapshot.run() in crates/package-manager/src/install_package_by_snapshot.rs is still called synchronously from async context, and internally uses rayon::par_iter (create_cas_files, create_symlink_layout), which also blocks a tokio worker. That's a smaller window than the tarball pipeline (dozens of files vs. the full SHA + gunzip + per-file loop), but if CI still flakes after this, that's the next thing to wrap in spawn_blocking.

Test plan

  • cargo check --workspace — clean.
  • cargo test -p pacquet-tarball — 7/7 pass (including packages_under_orgs_should_work, which actually downloads from npmjs.org).
  • cargo test -p pacquet-package-manager — 28/28 pass (including should_install_dependencies, the end-to-end install flow).
  • CI run of the Benchmark: Frozen Lockfile job — this is the real verification; the hang is timing-dependent and only triggers under the 1352-snapshot fixture's load profile.

zkochan added 2 commits April 24, 2026 14:37
The `frozen-lockfile` benchmark job has been flaky on CI with the 10-minute
step timeout firing during `Benchmark 1: pacquet@HEAD`. The logs show all
downloads completing from the network's point of view but ~1115 tarballs
stuck between "Download completed" and "Checksum verified", then ~3m46s
of total tracing silence before the runner kills the process.

Cause: the post-download work (SHA-512 of the full tarball, gzip inflate,
per-file SHA-512, `write_cas_file`, SQLite index INSERT) ran inside
`tokio::task::spawn(async move { ... })`. The body contains no `.await`
until the final nested `spawn_blocking` for the SQLite write, so each
tarball pins a tokio reactor worker for its whole duration. On a 2-core
GHA runner that caps throughput at ~2 tarballs in flight, and the tail
of large packages (100-200 KB compressed, MB+ inflated, dozens of files
each) overruns the CI step budget.

Swap the outer `spawn` to `spawn_blocking` so the work runs on tokio's
blocking pool (default 512 threads) instead. The nested `spawn_blocking`
around the SQLite write is no longer useful once we're already on the
blocking pool, so inline it.
@codecov

codecov Bot commented Apr 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.18605% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.52%. Comparing base (5d06cd5) to head (de63632).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/tarball/src/lib.rs 94.18% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #269   +/-   ##
=======================================
  Coverage   89.52%   89.52%           
=======================================
  Files          58       58           
  Lines        4619     4630   +11     
=======================================
+ Hits         4135     4145   +10     
- Misses        484      485    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Apr 24, 2026

Copy link
Copy Markdown

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.00     14.0±0.22ms   309.0 KB/sec    1.01     14.2±0.54ms   305.1 KB/sec

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CI flakiness in the tarball download/extract pipeline by moving the CPU-bound + blocking post-download processing off Tokio’s reactor workers and onto the blocking thread pool, preventing the async runtime from being pinned during checksum/decompression/CAS writes and SQLite updates.

Changes:

  • Run the full post-download tarball pipeline inside tokio::task::spawn_blocking instead of tokio::task::spawn.
  • Inline the SQLite index write (remove the now-redundant nested spawn_blocking).
  • Update comments to reflect the new execution model and rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs
Comment thread crates/tarball/src/lib.rs
Comment thread crates/tarball/src/lib.rs
zkochan added 2 commits April 24, 2026 15:09
# Conflicts:
#	crates/tarball/src/lib.rs
Two changes addressing Copilot review comments on #269:

* Gate the post-download `spawn_blocking` body with a global
  `Semaphore(num_cpus * 2, floor 4)`. The body is CPU-bound (SHA-512
  over the whole tarball + per-file SHA-512 + gzip inflate) with some
  blocking FS I/O, and the default 512-thread blocking pool let
  `try_join_all` fan out hundreds of these at once — good for I/O
  wait, disastrous for CPU work on a 2-core runner. The permit is
  held across the `spawn_blocking.await` so no task enters the body
  until one is free.

* Replace `.expect(\"tarball-processing blocking task panicked\")` on
  the spawn_blocking `JoinHandle` with
  `.map_err(TarballError::TaskJoin)?`. Runtime cancellation or a
  blocking-task panic now surfaces as an error rather than unwinding
  the whole install. `TarballError::TaskJoin` already exists for this.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
@github-actions

github-actions Bot commented Apr 24, 2026

Copy link
Copy Markdown

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 2.664 ± 0.069 2.562 2.790 1.03 ± 0.04
pacquet@main 2.596 ± 0.064 2.509 2.710 1.00
pnpm 5.876 ± 0.064 5.803 5.998 2.26 ± 0.06
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 2.66391291706,
      "stddev": 0.06850634720034338,
      "median": 2.66042407316,
      "user": 2.7733771999999997,
      "system": 2.1696338400000004,
      "min": 2.56184656316,
      "max": 2.78953393016,
      "times": [
        2.78953393016,
        2.6591420671600003,
        2.56184656316,
        2.66170607916,
        2.75062312816,
        2.67064702416,
        2.64672620416,
        2.6704100891600002,
        2.65410467916,
        2.57438940616
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 2.59560776136,
      "stddev": 0.06397719749447535,
      "median": 2.57909789716,
      "user": 2.7739412999999997,
      "system": 2.13686164,
      "min": 2.50921192316,
      "max": 2.71001720916,
      "times": [
        2.71001720916,
        2.54344085416,
        2.58397976916,
        2.57421602516,
        2.60857426916,
        2.6530442081600003,
        2.50921192316,
        2.54783886316,
        2.55532571616,
        2.67042877616
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 5.87616439526,
      "stddev": 0.06385839524139639,
      "median": 5.87455807816,
      "user": 8.685683000000001,
      "system": 2.72505444,
      "min": 5.802875316160001,
      "max": 5.99753820716,
      "times": [
        5.99753820716,
        5.82255908716,
        5.85518752716,
        5.89392862916,
        5.89710545416,
        5.802875316160001,
        5.90644541416,
        5.94764547516,
        5.8170060581600005,
        5.82135278416
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Addresses Copilot review on #269: `post_download_semaphore().acquire()`
used to run *after* `.bytes().await`, so under `try_join_all` fan-out a
fast registry could leave hundreds of fully-buffered tarball bodies
sitting in RAM waiting for a permit to process — a significant peak-RSS
risk on large installs.

Split the HTTP call into `send()` (headers) and `bytes()` (body) and
acquire the permit between them. The permit now covers both buffering
and processing, so the number of concurrently-buffered bodies is
bounded by the post-download cap (`num_cpus * 2`, floor 4). HTTP
headers-only concurrency is still gated separately by `ThrottledClient`
so the cheap head phase can still fan out.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants