fix(tarball): run post-download pipeline on blocking pool#269
Conversation
The `frozen-lockfile` benchmark job has been flaky on CI with the 10-minute
step timeout firing during `Benchmark 1: pacquet@HEAD`. The logs show all
downloads completing from the network's point of view but ~1115 tarballs
stuck between "Download completed" and "Checksum verified", then ~3m46s
of total tracing silence before the runner kills the process.
Cause: the post-download work (SHA-512 of the full tarball, gzip inflate,
per-file SHA-512, `write_cas_file`, SQLite index INSERT) ran inside
`tokio::task::spawn(async move { ... })`. The body contains no `.await`
until the final nested `spawn_blocking` for the SQLite write, so each
tarball pins a tokio reactor worker for its whole duration. On a 2-core
GHA runner that caps throughput at ~2 tarballs in flight, and the tail
of large packages (100-200 KB compressed, MB+ inflated, dozens of files
each) overruns the CI step budget.
Swap the outer `spawn` to `spawn_blocking` so the work runs on tokio's
blocking pool (default 512 threads) instead. The nested `spawn_blocking`
around the SQLite write is no longer useful once we're already on the
blocking pool, so inline it.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #269 +/- ##
=======================================
Coverage 89.52% 89.52%
=======================================
Files 58 58
Lines 4619 4630 +11
=======================================
+ Hits 4135 4145 +10
- Misses 484 485 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Micro-Benchmark ResultsLinux |
There was a problem hiding this comment.
Pull request overview
This PR addresses CI flakiness in the tarball download/extract pipeline by moving the CPU-bound + blocking post-download processing off Tokio’s reactor workers and onto the blocking thread pool, preventing the async runtime from being pinned during checksum/decompression/CAS writes and SQLite updates.
Changes:
- Run the full post-download tarball pipeline inside
tokio::task::spawn_blockinginstead oftokio::task::spawn. - Inline the SQLite index write (remove the now-redundant nested
spawn_blocking). - Update comments to reflect the new execution model and rationale.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two changes addressing Copilot review comments on #269: * Gate the post-download `spawn_blocking` body with a global `Semaphore(num_cpus * 2, floor 4)`. The body is CPU-bound (SHA-512 over the whole tarball + per-file SHA-512 + gzip inflate) with some blocking FS I/O, and the default 512-thread blocking pool let `try_join_all` fan out hundreds of these at once — good for I/O wait, disastrous for CPU work on a 2-core runner. The permit is held across the `spawn_blocking.await` so no task enters the body until one is free. * Replace `.expect(\"tarball-processing blocking task panicked\")` on the spawn_blocking `JoinHandle` with `.map_err(TarballError::TaskJoin)?`. Runtime cancellation or a blocking-task panic now surfaces as an error rather than unwinding the whole install. `TarballError::TaskJoin` already exists for this.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Integrated-Benchmark Report (Linux)Scenario: Frozen Lockfile
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 2.66391291706,
"stddev": 0.06850634720034338,
"median": 2.66042407316,
"user": 2.7733771999999997,
"system": 2.1696338400000004,
"min": 2.56184656316,
"max": 2.78953393016,
"times": [
2.78953393016,
2.6591420671600003,
2.56184656316,
2.66170607916,
2.75062312816,
2.67064702416,
2.64672620416,
2.6704100891600002,
2.65410467916,
2.57438940616
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pacquet@main",
"mean": 2.59560776136,
"stddev": 0.06397719749447535,
"median": 2.57909789716,
"user": 2.7739412999999997,
"system": 2.13686164,
"min": 2.50921192316,
"max": 2.71001720916,
"times": [
2.71001720916,
2.54344085416,
2.58397976916,
2.57421602516,
2.60857426916,
2.6530442081600003,
2.50921192316,
2.54783886316,
2.55532571616,
2.67042877616
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pnpm",
"mean": 5.87616439526,
"stddev": 0.06385839524139639,
"median": 5.87455807816,
"user": 8.685683000000001,
"system": 2.72505444,
"min": 5.802875316160001,
"max": 5.99753820716,
"times": [
5.99753820716,
5.82255908716,
5.85518752716,
5.89392862916,
5.89710545416,
5.802875316160001,
5.90644541416,
5.94764547516,
5.8170060581600005,
5.82135278416
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
} |
Addresses Copilot review on #269: `post_download_semaphore().acquire()` used to run *after* `.bytes().await`, so under `try_join_all` fan-out a fast registry could leave hundreds of fully-buffered tarball bodies sitting in RAM waiting for a permit to process — a significant peak-RSS risk on large installs. Split the HTTP call into `send()` (headers) and `bytes()` (body) and acquire the permit between them. The permit now covers both buffering and processing, so the number of concurrently-buffered bodies is bounded by the post-download cap (`num_cpus * 2`, floor 4). HTTP headers-only concurrency is still gated separately by `ThrottledClient` so the cheap head phase can still fan out.
Summary
The
frozen-lockfilebenchmark job has been flaky, with the 10-minute step timeout firing duringBenchmark 1: pacquet@HEAD(e.g. this run's log). Diagnosis:Download completed— network side was fine (the reqwest timeouts from fix(network): set explicit timeouts on default reqwest client; enable CI tracing #267 did their job).Checksum verified— ~1115 stuck in the post-download stage.12:07:55, then 3m46s of total silence before the runner SIGKILLed the job.pacquetwas still alive in the orphan-process list.The stuck stage was this block in
crates/tarball/src/lib.rs:The body is entirely synchronous CPU + blocking I/O with no
.awaituntil the final nestedspawn_blocking. Every tarball pins a tokio reactor worker for its full duration. On a 2-core GHA runner that caps concurrency at ~2 tarballs, and the tail of large packages (100-200 KB compressed, MB+ inflated, dozens of files each) overruns the 10-minute step budget. The per-second throughput trace makes it obvious:Change
tokio::task::spawn(async move { ... })→tokio::task::spawn_blocking(move || { ... }). The work now runs on tokio's blocking pool (default 512 threads) instead of the 2 reactor workers.spawn_blockingaround the SQLite write is no longer useful — we're already on the blocking pool — so it's inlined as a direct synchronous call.!Sendtar::Archivedrop-ordering reasoning no longer applies; the scope is kept for memory efficiency since the inflated buffer can be multi-MB).What this does NOT address
CreateVirtualDirBySnapshot.run()incrates/package-manager/src/install_package_by_snapshot.rsis still called synchronously from async context, and internally usesrayon::par_iter(create_cas_files,create_symlink_layout), which also blocks a tokio worker. That's a smaller window than the tarball pipeline (dozens of files vs. the full SHA + gunzip + per-file loop), but if CI still flakes after this, that's the next thing to wrap inspawn_blocking.Test plan
cargo check --workspace— clean.cargo test -p pacquet-tarball— 7/7 pass (includingpackages_under_orgs_should_work, which actually downloads from npmjs.org).cargo test -p pacquet-package-manager— 28/28 pass (includingshould_install_dependencies, the end-to-end install flow).Benchmark: Frozen Lockfilejob — this is the real verification; the hang is timing-dependent and only triggers under the 1352-snapshot fixture's load profile.