You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Continuation tracker for the perf work landed in #11856. The current state plus the most promising next steps are below — picked up where I left off, the file:line hooks should make the rest of this run-able in a fresh session.
User CPU is now lower than pnpm (~6 s vs ~7 s on clean-install). The remaining gap is almost entirely `sys` time — file-system syscalls during the install/import phase:
Scenario
pacquet sys
pnpm sys
clean-install
42 s
20 s
full-resolution
27 s
10 s
A `sample(1)` trace during install (release-debug build) shows the hot spots on tokio worker threads, ranked:
`store_dir::write_cas_file` — Sha512 + open(O_CREAT|O_EXCL) + write per file.
`tarball::extract_tarball_entries` — per-entry tar read + buffer + dispatch into `write_cas_file`.
`fs::ensure_file::write_atomic` — atomic CAS write path.
`store_dir::store_index::StoreIndex::get` — SQLite lookups per snapshot.
`store_dir::check_pkg_files_integrity::check_file` — per-file `fs::metadata` on warm reinstall.
Concrete next steps (ranked by expected impact)
1. Sequential per-tarball file write loop in `extract_tarball_entries`
`pacquet/crates/tarball/src/lib.rs:529-678` walks every tar entry in a single loop and calls `StoreDir::write_cas_file` synchronously for each. Within a tarball, files are written serially; only across tarballs is there parallelism. For a 100-file tarball, that's 100 sequential `open/write/close` syscall triples on one thread.
Investigate `tar::Entries` parallelism. The crate's iterator is sequential by construction (each `Entry` borrows the underlying reader's position), so we can't trivially `par_iter`. Two plausible shapes: (a) collect tar entries into owned `Vec<(path, mode, Vec)>` first, then rayon-parallel `write_cas_file`; (b) keep the loop sequential but move `Sha512::digest` + the write into a rayon channel-fed pipeline so disk I/O and hashing overlap.
Memory budget matters — large tarballs (e.g. `@babel/standalone`) can be >10 MB; collecting all entries up front would spike RSS. A bounded channel that owns `Vec` for one entry at a time keeps RSS flat.
pnpm's `store/cafs/src/addFilesFromTarball.ts` does the same sequential walk but the JIT + libuv worker pool give it implicit per-file parallelism we don't get from a single tokio worker calling a sync extract loop. Worth checking whether `extract_tarball_entries` actually runs inside `spawn_blocking` today (it should — verify via `pacquet/crates/tarball/src/lib.rs` ≈ `run_without_mem_cache`).
2. Per-path `cas_write_lock` overhead at `pacquet/crates/fs/src/ensure_file.rs:217`
Every CAS write acquires `Arc<Mutex<()>>` from a process-wide `DashMap<PathBuf, Arc<Mutex<()>>>`. On a 1362-package install with ~100 files/package, that's ~136k `DashMap::entry` operations plus the mutex acquire. The map is never pruned.
The lock exists to coordinate writers vs. concurrent verifiers (`check_pkg_files_integrity` may delete the file while a writer is still appending; see the doc comment). For CAFS paths that are written exactly once per install — which is the vast majority of paths — the lock is dead weight.
Investigate: is there a way to skip the lock when we know the path is fresh? A `(stripe, hash[0])` lock array (e.g. 256 mutexes keyed by first byte) might be cheaper than the per-path `DashMap` and just as correct since the verifier only races writers on the same path.
pnpm's `store/cafs/src/writeFile.ts` uses `locker: Map<string, number>` — a refcount, not a mutex per path. Worth modelling here.
3. `StoreIndex::get` SQLite calls per snapshot
`pacquet/crates/store-dir/src/store_index.rs` exposes `get(key)` which the install path currently calls once per snapshot during `run_with_mem_cache`'s store-lookup branch. With `prefetch_cas_paths` running once at the install head (1362 keys batched), the per-snapshot `get` should be a hit-or-miss against the prefetched map and never touch SQLite. But on a cold or partial `store-dir`, snapshots that the prefetch missed still serialize on the shared `Arc<Mutex>`.
Audit `pacquet/crates/tarball/src/lib.rs::run_with_mem_cache` and `load_cached_cas_paths` for the cold-miss path. Are we taking the `Arc<Mutex>` lock per snapshot when we already have a populated `prefetched_cas_paths` for the rest of the install?
Consider exposing `store_index.bulk_get(&[key]) -> HashMap` on the `Arc` directly so the install pass can run a single `SELECT ... WHERE key IN (...)` rather than N round-trips behind the mutex.
`pacquet/crates/store-dir/src/check_pkg_files_integrity.rs:410` stats every file to compare `mtime` against `checked_at`. On a warm reinstall (full-resolution scenario), this fires once per file per package — ~130k stat syscalls. The `verified_files_cache` is supposed to dedup but only at the `(file_path)` level; multiple snapshots referencing the same CAFS blob still re-stat.
Confirm `SharedVerifiedFilesCache` is actually deduping at the path level on this workload (sample says it's still hot, so maybe it isn't). `pacquet/crates/store-dir/src/lib.rs` `SharedVerifiedFilesCache`.
pnpm verifies once per blob per process and caches `{ ino, dev }` so the second consumer of a popular CAFS path (e.g. `react/index.js`) just compares inode numbers. `pacquet/crates/package-manager/src/import_indexed_dir.rs` could grow a similar cache so the `link_file` fast path skips the stat entirely when the source has been verified this install.
5. `load_meta` still appears in samples (~145 in install-phase, ~110 in resolve-phase)
The packument cache fix in `c5562c8d01` collapsed most of the resolve-side hits, but `load_meta` is still in the top 10. Spot-check what's hitting it post-fix — most likely the cold-mirror path on a brand-new registry where the first install per process has to materialize the mirror from network. Less impact than the items above but worth a profile pass after #1-3 land.
Reproducing the bench
just registry-mock launch # one-time
cd /Volumes/src/pnpm/pnpm/<worktree>
cargo run --release --bin=integrated-benchmark -- \\
--scenario clean-install \\
--registry-port <port-from-launch> \\
--runs 5 --warmup 2 \\
--with-pnpm \\
pacquet@HEAD pacquet@main
For full-resolution, swap `--scenario full-resolution`. For per-phase timing, set `TRACE=pacquet=info` and run `./bench-work-env/pacquet@HEAD/pacquet/target/release/pacquet install` directly — the per-phase `elapsed_ms` is emitted to stderr via `tracing::info!(target: "pacquet::install::phase", ...)` at `pacquet/crates/package-manager/src/install_with_fresh_lockfile.rs` (search for "phase complete").
`cache_dir` is project-global, not bench-scoped. With `XDG_CACHE_HOME=/.cache` set, pacquet writes mirrors to `/.cache/pnpm/v11/metadata/localhost+/`. The harness only wipes `bench-work-env/pacquet@HEAD/{node_modules,pnpm-lock.yaml,store-dir}` between iterations, so the metadata mirror is warm across runs. Verify with `ls ~/.cache/pnpm/v11/metadata/localhost+ | wc -l` after the first iteration.
The verdaccio mock degrades under sustained load — its RSS grows and request latency creeps up over a long bench session. If you see pacquet's wall time monotonically growing across runs while user-CPU stays flat, restart the mock (`just registry-mock end && just registry-mock launch`). It picks a new port each time, so update `--registry-port` accordingly.
The integrated-benchmark requires running from the repo root so `canonicalize(".")` resolves to the pacquet git repo. The verifier now accepts a `.git` file (linked git worktree) as well as a directory — see the verifier change in perf(pacquet): close the clean-install gap to pnpm CLI #11856.
`pacquet@HEAD` builds inside `bench-work-env/` — it does a fresh `git fetch` + checkout of the SHA, then `cargo build --release`. Two minutes per cold build. Working-tree changes are not picked up until you commit them and the bench reruns the build.
Per-phase timing emits: search `tracing::info!(target: "pacquet::install::phase"` in `pacquet/crates/package-manager/src/install_with_fresh_lockfile.rs`.
Suggested attack order
Start with #1 (tarball extract parallelism) — biggest sample share, likely 5-10s of wallclock on clean-install. Re-bench. Then #3 (StoreIndex bulk lookup) if cold-store install is still slow. #2 and #4 are smaller wins but cheap to land. #5 last, after re-sampling.
Written by an agent (Claude Code, claude-opus-4-7).
Continuation tracker for the perf work landed in #11856. The current state plus the most promising next steps are below — picked up where I left off, the file:line hooks should make the rest of this run-able in a fresh session.
Where we are after #11856
On the `alotta-files` fixture with the verdaccio mock (`just registry-mock launch`), 5 runs / 2 warmups, local machine:
User CPU is now lower than pnpm (~6 s vs ~7 s on clean-install). The remaining gap is almost entirely `sys` time — file-system syscalls during the install/import phase:
A `sample(1)` trace during install (release-debug build) shows the hot spots on tokio worker threads, ranked:
Concrete next steps (ranked by expected impact)
1. Sequential per-tarball file write loop in `extract_tarball_entries`
`pacquet/crates/tarball/src/lib.rs:529-678` walks every tar entry in a single loop and calls `StoreDir::write_cas_file` synchronously for each. Within a tarball, files are written serially; only across tarballs is there parallelism. For a 100-file tarball, that's 100 sequential `open/write/close` syscall triples on one thread.
2. Per-path `cas_write_lock` overhead at `pacquet/crates/fs/src/ensure_file.rs:217`
Every CAS write acquires `Arc<Mutex<()>>` from a process-wide `DashMap<PathBuf, Arc<Mutex<()>>>`. On a 1362-package install with ~100 files/package, that's ~136k `DashMap::entry` operations plus the mutex acquire. The map is never pruned.
3. `StoreIndex::get` SQLite calls per snapshot
`pacquet/crates/store-dir/src/store_index.rs` exposes `get(key)` which the install path currently calls once per snapshot during `run_with_mem_cache`'s store-lookup branch. With `prefetch_cas_paths` running once at the install head (1362 keys batched), the per-snapshot `get` should be a hit-or-miss against the prefetched map and never touch SQLite. But on a cold or partial `store-dir`, snapshots that the prefetch missed still serialize on the shared `Arc<Mutex>`.
4. `check_file` (warm reinstall) — per-file `fs::metadata`
`pacquet/crates/store-dir/src/check_pkg_files_integrity.rs:410` stats every file to compare `mtime` against `checked_at`. On a warm reinstall (full-resolution scenario), this fires once per file per package — ~130k stat syscalls. The `verified_files_cache` is supposed to dedup but only at the `(file_path)` level; multiple snapshots referencing the same CAFS blob still re-stat.
5. `load_meta` still appears in samples (~145 in install-phase, ~110 in resolve-phase)
The packument cache fix in `c5562c8d01` collapsed most of the resolve-side hits, but `load_meta` is still in the top 10. Spot-check what's hitting it post-fix — most likely the cold-mirror path on a brand-new registry where the first install per process has to materialize the mirror from network. Less impact than the items above but worth a profile pass after #1-3 land.
Reproducing the bench
For full-resolution, swap `--scenario full-resolution`. For per-phase timing, set `TRACE=pacquet=info` and run `./bench-work-env/pacquet@HEAD/pacquet/target/release/pacquet install` directly — the per-phase `elapsed_ms` is emitted to stderr via `tracing::info!(target: "pacquet::install::phase", ...)` at `pacquet/crates/package-manager/src/install_with_fresh_lockfile.rs` (search for "phase complete").
For CPU sampling on macOS with symbols:
Gotchas the bench harness will trip you on
/.cache` set, pacquet writes mirrors to `/.cache/pnpm/v11/metadata/localhost+/`. The harness only wipes `bench-work-env/pacquet@HEAD/{node_modules,pnpm-lock.yaml,store-dir}` between iterations, so the metadata mirror is warm across runs. Verify with `ls ~/.cache/pnpm/v11/metadata/localhost+ | wc -l` after the first iteration.Pointers used in the prior session
Suggested attack order
Start with #1 (tarball extract parallelism) — biggest sample share, likely 5-10s of wallclock on clean-install. Re-bench. Then #3 (StoreIndex bulk lookup) if cold-store install is still slow. #2 and #4 are smaller wins but cheap to land. #5 last, after re-sampling.
Written by an agent (Claude Code, claude-opus-4-7).