perf: batch store-index prefetch + drive warm link phase on rayon#292
Conversation
…m link phase via single rayon par_iter Two structural changes that together cut warm-cache install wall time on a 1352-package lockfile from ~10.85s to ~6.5s, beating pnpm v11 (~8.5s) by ~23% on the same machine. 1. `pacquet_tarball::prefetch_cas_paths` runs every store-index SELECT + integrity check the lockfile is going to need in one upfront `spawn_blocking` task, returning a `cache_key → cas_paths` map. `DownloadTarballToStore` now consults this map before falling through to its per-call `load_cached_cas_paths`. Eliminates the 1352 separate `spawn_blocking` round-trips a warm install was doing — sample-profiling showed those each accumulating 20-60 ms of OS scheduling overhead while their actual work is sub-ms. 2. In `create_virtual_store::run`, partition snapshots into "warm" (prefetched) and "cold" (cache miss). The warm batch runs entirely on rayon via one big `par_iter` over the snapshot list — bypasses tokio futures, `try_join_all`, and per-snapshot `rayon::join`. The cold batch keeps the existing `try_join_all` + download path. Previously every per-snapshot tokio future called sync `rayon::join` from a tokio worker; with up to 10 such futures progressing concurrently and each one's inner par_iter saturating the rayon pool, the pool ended up processing one snapshot at a time despite tokio's worker count (`sum-of-link ≈ wall`, i.e. 1× parallelism). Going straight to one big `par_iter` lets rayon schedule across all 1352 snapshots as one work-stealing graph, the shape pnpm's piscina pool gives implicitly. Plus minor tuning: `crates/cli/src/lib.rs::configure_rayon_pool` sizes the global rayon pool at `availableParallelism * 2`. The link phase is dominated by `clonefile` syscalls that block on the kernel metadata journal, not CPU; oversubscribing CPUs keeps more syscalls in flight without losing wall time. Sweep on alot7: 4→9.5s, 10→9.4s, 20→9.25s, 50→9.42s, 100→10.2s — 2× was the knee. Honours an explicit `RAYON_NUM_THREADS` override. Verified `just ready` (typos / fmt / check / nextest / lint), and that `pacquet install` produces an identical `node_modules` to pnpm on the same lockfile (1495 package.json, 400M).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #292 +/- ##
==========================================
- Coverage 89.46% 88.60% -0.86%
==========================================
Files 61 61
Lines 5355 5662 +307
==========================================
+ Hits 4791 5017 +226
- Misses 564 645 +81 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Micro-Benchmark ResultsLinux |
There was a problem hiding this comment.
Pull request overview
This PR optimizes warm-cache install performance by (1) batching store-index cache lookups into a single prefetch step and (2) running the warm link/import phase as one rayon-parallel batch, plus configuring rayon’s global thread pool for syscall-heavy linking.
Changes:
- Add
prefetch_cas_paths+PrefetchedCasPathsand plumb a prefetched lookup map intoDownloadTarballToStore. - Rework
CreateVirtualStoreto prefetch cache keys once, split warm vs cold snapshots, and process the warm batch via a singlerayon::par_iter. - Configure rayon global pool sizing in the CLI (2× CPU count, unless
RAYON_NUM_THREADSis set).
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tasks/micro-benchmark/src/main.rs | Updates tarball benchmark to populate the new prefetched_cas_paths field. |
| crates/tarball/src/lib.rs | Introduces PrefetchedCasPaths + prefetch_cas_paths and consults prefetched results in DownloadTarballToStore. |
| crates/package-manager/src/create_virtual_store.rs | Adds batched prefetch and warm/cold partitioning; runs warm installs on rayon. |
| crates/package-manager/src/install_package_from_registry.rs | Sets prefetched_cas_paths: None for registry installs (non-prefetch path). |
| crates/package-manager/src/install_package_by_snapshot.rs | Threads prefetched map through snapshot installs. |
| crates/cli/src/lib.rs | Builds rayon global pool early and sizes it for syscall-bound link work. |
| crates/cli/Cargo.toml | Adds rayon and num_cpus dependencies. |
| Cargo.lock | Updates lockfile for the added dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Integrated-Benchmark Report (Linux)Scenario: Frozen Lockfile
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 2.92706444658,
"stddev": 0.1202433825786987,
"median": 2.9144155669800003,
"user": 2.5687368800000003,
"system": 3.51470238,
"min": 2.7391405574800003,
"max": 3.1694147954800003,
"times": [
3.1694147954800003,
3.06216673548,
2.85642181648,
2.8737411884800004,
2.9670828994800003,
2.9295761754800003,
2.8400883724800003,
2.8992549584800003,
2.9337569664800003,
2.7391405574800003
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pacquet@main",
"mean": 2.83942509298,
"stddev": 0.13281203972612804,
"median": 2.81692019448,
"user": 2.54584968,
"system": 3.3272771800000003,
"min": 2.70832509448,
"max": 3.08801054648,
"times": [
2.8090819864800003,
2.70832509448,
3.08801054648,
2.83769050048,
2.75718810748,
2.73013330848,
3.06577862948,
2.7471772104800003,
2.82475840248,
2.8261071434800002
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pnpm",
"mean": 6.3481577938800005,
"stddev": 0.05547111378880295,
"median": 6.34119294048,
"user": 8.65827898,
"system": 4.32689188,
"min": 6.25613580348,
"max": 6.44547117148,
"times": [
6.40425251348,
6.44547117148,
6.32917973148,
6.307393369480001,
6.3592412974800006,
6.30205319548,
6.25613580348,
6.33987547148,
6.39546497548,
6.34251040948
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
} |
- Drop the store-index mutex between the SELECT loop and per-package integrity checks in `prefetch_cas_paths`. The original implementation held the lock across the whole batch, including the filesystem-bound `check_pkg_files_integrity` work; restructured so the lock is held only for the SELECTs. - Wrap prefetched values in `Arc<HashMap<…>>`. The cold-batch fallback in `DownloadTarballToStore::run_without_mem_cache` previously had to deep-clone the per-file map on every prefetched hit to honour the function's owned-`HashMap` return type; cloning the `Arc` first drops that cost to one heap-allocation regardless of file count. - Dedup `cache_keys` before calling `prefetch_cas_paths`. Lockfiles with peer-dependency variants of the same package collapse to one cache key (the key uses `metadata_key.without_peer()`), so the unsorted vec was making the prefetch redo identical SELECT + integrity-check work for every variant. - Add a unit test for the prefetched short-circuit: `reuses_prefetched_cas_paths_when_provided` proves `run_without_mem_cache` returns the prefetched map without touching SQLite or the network. - Doc fix: the rayon-pool comment said `2 × availableParallelism` but the code uses `num_cpus::get()`. Reword to match the workspace's existing `num_cpus` convention (e.g. `tarball::post_download_semaphore`).
|
The automated benchmarks don't reflect these improvements. We'll need to add more scenarios. I was working on this for a couple of days - it was the hardest to fix. This is a headless install with hot cache. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- `(**cas_paths).clone()` in the cold-fallback path is a deep clone of the per-file `HashMap`, not an `Arc::clone`. Reword the comment to match: the `Arc` wrapping in `PrefetchedCasPaths` saves the deep clone on the new warm-batch path (which uses `cas_paths.as_ref()`), not on this fallback path. Propagating the `Arc` through `DownloadTarballToStore::run_without_mem_cache`'s return type would be a wider refactor. - Gate the warm-batch `par_iter` on `tokio::task::block_in_place` when running on a multi-thread runtime, falling back to a plain inline call on `current_thread`. The previous unconditional `par_iter` from inside an `async fn` blocked one tokio worker for the whole warm batch — fine on the production multi-thread runtime, but on a 2-core / 2-worker runtime it would halve async progress for the rest of the install. `block_in_place` migrates any other futures off the calling worker first, but it panics on `current_thread` (which is what `#[tokio::test]` uses by default), so the flavor check keeps the existing test suite working. - Switch `configure_rayon_pool` from `num_cpus::get()` to `std::thread::available_parallelism()`. `num_cpus` reports the host's logical CPU count regardless of cgroup / CPU-quota limits, which on a quota-limited runner could spin up more rayon threads than the kernel will actually schedule onto our cores. Drop the `num_cpus` dep from `pacquet-cli` since this was its only use.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Extract `snapshot_cache_key` helper for the `(integrity, pkg_id)` → cache-key derivation. Both call sites (the upfront `cache_keys` build and the warm/cold partition loop) now go through the same helper, so a future change to the resolution-type handling or key shape stays in one place. A drift between the two loops would silently misclassify warm entries as cold and quietly halve install speed. - Document the intentional `.max(4)` floor in `configure_rayon_pool`. Even on a quota-limited 1-2-CPU runner, dropping below 4 puts rayon back into the regime where one thread blocks on `clonefile` while the next ready snapshot can't start; the metadata-journal bottleneck doesn't shrink with quota, so a small intentional oversubscription is the better trade. Both the function-level doc and an inline comment by the `.max(4)` call explain the choice.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Previously `CreateVirtualStore::run` walked `snapshots` twice and
called `snapshot_cache_key` on each pass — once to build the deduped
prefetch input, once to partition warm vs cold. Each call allocates
two strings (the integrity copy and the `format!`-built cache key),
so the second walk was paying ~2 × snapshots × {2 String allocs} for
no benefit.
Single-pass shape:
1. Walk `snapshots` once, build a `Vec<(snapshot_key, snapshot,
Option<String>)>` that owns the per-snapshot cache key.
2. Collect deduped `&str` refs into a `Vec<&str>`, sort + dedup,
then collect to owned `String` for the prefetch boundary —
so the prefetch input contains one `String` per *unique* key
rather than one per snapshot (peer-variant lockfiles already
gave us cheap dedup; this also removes the dedup-stage clones).
3. Partition by re-using the stashed `Option<String>` on each
entry: a `cache_key.as_deref().and_then(|k| prefetched.get(k))`
lookup, no key recomputation.
Net: one `snapshot_cache_key` call per snapshot (was two), and the
prefetch sees a deduped owned `Vec<String>` of size
`unique_cache_keys` (was `snapshots.len()` before its own internal
sort/dedup). Bench is unchanged within noise on the alot7 sandbox
(~6.18 s warm); the win is allocation pressure on huge lockfiles
where peer-variant duplication is high.
Copilot follow-up review on #292.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Move `configure_rayon_pool` to after `CliArgs::parse()`. clap's `--help` / `--version` paths exit inside `parse`, so they never reach the rayon pool init now — `pacquet --help` no longer spawns worker threads it'll throw away. Other subcommand paths still pay the (small) init cost; we can move it into the install path specifically as a follow-up if non-install commands are ever added that don't need rayon. - Drop the local `CasPathsArc` / `WarmEntry` aliases in `create_virtual_store::run`. They duplicated the value type of `pacquet_tarball::PrefetchedCasPaths`; if that map's value type changes, the local aliases would silently drift. Letting `warm`'s element type be inferred from its `push` calls keeps the binding to the upstream type. - Add three unit tests for `prefetch_cas_paths`: * `prefetch_cas_paths_returns_hits_for_live_index_rows` — happy path: index row written, CAFS blob present, verify on; result contains the expected file map. * `prefetch_cas_paths_omits_failed_integrity_entries` — index row whose digest matches no on-disk file; verify on; row is dropped from the result rather than surfacing a half-populated map. * `prefetch_cas_paths_skips_filesystem_checks_when_verify_disabled` — same shape as the failing-integrity test but with `verify_store_integrity = false`; the row surfaces (no fs checks), matching pnpm's `verify-store-integrity: false` semantics.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Change `snapshot_cache_key` from `Option<String>` to `Result<Option<String>, CreateVirtualStoreError>`: - Missing metadata, missing tarball integrity, and currently- unsupported `directory` / `git` resolutions all return `Err(...)` with the same error variants the cold batch's `InstallPackageBySnapshot::run` would surface anyway. - `Ok(Some(key))` for valid tarball / registry resolutions (today the only shape that participates in the CAFS prefetch). - `Ok(None)` is reserved for any future legitimate non-CAFS resolution variant, so adding one later doesn't force a wider refactor. The orchestrator validates every snapshot before starting the warm batch — `snapshot_entries` is built with `.collect::<Result<_, _>>()?` — so a malformed lockfile fails immediately instead of after the ~6-second warm rayon batch. Previously those snapshots were collapsed into `None` and shoved into the cold batch, which only runs *after* warm completes; the same error fired but the user waited through a wasted warm phase first (Copilot review on #292).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The Copilot comment on Rust's match ergonomics destructure |
…ssion (#298) ## Summary Closes #294. Replaces the per-key SELECT loop in `prefetch_cas_paths` with a single batched `SELECT key, data FROM package_index WHERE key IN (?, ?, …)`, chunked at 999 placeholders. SQLite walks the `package_index` PK B-tree once per chunk instead of once per key, collapsing N round-trips across the SQLite mutex into one query. This drops the ~50 ms of all-miss query overhead the empty-store install paid after #292 introduced the install-wide prefetch — the cold-cache regression the issue reports. ## Changes - `StoreIndex::get_many(&[String]) -> Result<HashMap<String, PackageFilesIndex>, StoreIndexError>` — batched lookup, chunked at `GET_MANY_CHUNK = 999` (safe vs. legacy `SQLITE_MAX_VARIABLE_NUMBER` builds; rusqlite's bundled SQLite caps at 32766). - `prefetch_cas_paths` Phase 1 now calls `guard.get_many(&cache_keys)` instead of looping `guard.get(&cache_key)` per key. ## Behaviour notes - Decode failures on a row are logged at `debug!` and skipped, matching `load_cached_cas_paths`'s `.ok()?` stance on the per-key path — a malformed row is a cache miss, not a batch failure. - A SQLite error during the batched read bubbles up as `StoreIndexError::Read`; `prefetch_cas_paths` already turns that into a soft fallback (per-snapshot lookups).
Summary
Two structural changes that together cut warm-cache install wall time on a 1352-package lockfile from ~10.85s → ~6.5s, beating pnpm v11.0.0-rc.5 (~8.5s on the same machine) by ~23%.
mainbaselineMeasured on macOS (10-core M-class, APFS, store on same volume), warm cache, 8 runs each,
ulimit -n 65536.Changes
1. Batched store-index prefetch (
crates/tarball/src/lib.rs::prefetch_cas_paths)Walks every
(integrity, pkg_id)the lockfile mentions in onespawn_blockingtask at install start, runs all integrity checks, and returns acache_key → cas_pathsHashMap.DownloadTarballToStoreconsults this map before falling through to the per-callload_cached_cas_paths.This eliminates 1352 separate
spawn_blockinground-trips a warm install was doing. Sample-profiling showed those each accumulating 20-60 ms of OS scheduling overhead while their actual work (≈40 µs SELECT + per-file integrity stats) is sub-ms — the queue across the default 512-thread blocking pool was dwarfing the work itself.2. Warm batch via one global
rayon::par_iter(crates/package-manager/src/create_virtual_store.rs)Partition snapshots into "warm" (prefetched) and "cold" (cache miss). The warm batch bypasses tokio futures entirely:
The cold batch keeps the existing
try_join_all+ download path (rare on warm install, common on cold).Why this matters: previously every per-snapshot tokio future called sync
rayon::joinfrom a tokio worker. With up to 10 such futures progressing concurrently and each one's innerpar_itersaturating the rayon pool, the pool ended up processing one snapshot at a time despite tokio's worker count —sum-of-link ≈ wall, i.e. effectively 1× parallelism. Going straight to one bigpar_iterlets rayon schedule across all 1352 snapshots as one work-stealing graph, the shape pnpm's piscina pool gives implicitly.3. Rayon pool sized at
availableParallelism * 2(crates/cli/src/lib.rs::configure_rayon_pool)The link phase is dominated by
clonefilesyscalls that block on the kernel's metadata journal, not CPU work. Oversubscribing CPUs gives more in-flight syscalls without losing wall time. Sweep on alot7:2x was the knee. Honours an explicit
RAYON_NUM_THREADSenv var.Test plan
just ready(typos, fmt, check, full nextest, lint) — clean.pacquet install --frozen-lockfileonalot7produces an identicalnode_modulesto pnpm: 1495package.jsonfiles, 400M.