Skip to content

perf(pacquet): close the remaining install-phase sys-time gap to pnpm #11857

Description

@zkochan

Continuation tracker for the perf work landed in #11856. The current state plus the most promising next steps are below — picked up where I left off, the file:line hooks should make the rest of this run-able in a fresh session.

Where we are after #11856

On the `alotta-files` fixture with the verdaccio mock (`just registry-mock launch`), 5 runs / 2 warmups, local machine:

Scenario pacquet@main pacquet@HEAD (#11856) pnpm Ratio
clean-install 31.7 s 12.3 s 7.9 s 1.56× slower than pnpm
full-resolution 40.9 s 27.4 s 11.5 s 2.39× slower than pnpm

User CPU is now lower than pnpm (~6 s vs ~7 s on clean-install). The remaining gap is almost entirely `sys` time — file-system syscalls during the install/import phase:

Scenario pacquet sys pnpm sys
clean-install 42 s 20 s
full-resolution 27 s 10 s

A `sample(1)` trace during install (release-debug build) shows the hot spots on tokio worker threads, ranked:

  1. `store_dir::write_cas_file` — Sha512 + open(O_CREAT|O_EXCL) + write per file.
  2. `tarball::extract_tarball_entries` — per-entry tar read + buffer + dispatch into `write_cas_file`.
  3. `fs::ensure_file::write_atomic` — atomic CAS write path.
  4. `fs::ensure_file::cas_write_lock` — per-path `DashMap` entry + Mutex acquire.
  5. `store_dir::store_index::StoreIndex::get` — SQLite lookups per snapshot.
  6. `store_dir::check_pkg_files_integrity::check_file` — per-file `fs::metadata` on warm reinstall.

Concrete next steps (ranked by expected impact)

1. Sequential per-tarball file write loop in `extract_tarball_entries`

`pacquet/crates/tarball/src/lib.rs:529-678` walks every tar entry in a single loop and calls `StoreDir::write_cas_file` synchronously for each. Within a tarball, files are written serially; only across tarballs is there parallelism. For a 100-file tarball, that's 100 sequential `open/write/close` syscall triples on one thread.

  • Investigate `tar::Entries` parallelism. The crate's iterator is sequential by construction (each `Entry` borrows the underlying reader's position), so we can't trivially `par_iter`. Two plausible shapes: (a) collect tar entries into owned `Vec<(path, mode, Vec)>` first, then rayon-parallel `write_cas_file`; (b) keep the loop sequential but move `Sha512::digest` + the write into a rayon channel-fed pipeline so disk I/O and hashing overlap.
  • Memory budget matters — large tarballs (e.g. `@babel/standalone`) can be >10 MB; collecting all entries up front would spike RSS. A bounded channel that owns `Vec` for one entry at a time keeps RSS flat.
  • pnpm's `store/cafs/src/addFilesFromTarball.ts` does the same sequential walk but the JIT + libuv worker pool give it implicit per-file parallelism we don't get from a single tokio worker calling a sync extract loop. Worth checking whether `extract_tarball_entries` actually runs inside `spawn_blocking` today (it should — verify via `pacquet/crates/tarball/src/lib.rs` ≈ `run_without_mem_cache`).

2. Per-path `cas_write_lock` overhead at `pacquet/crates/fs/src/ensure_file.rs:217`

Every CAS write acquires `Arc<Mutex<()>>` from a process-wide `DashMap<PathBuf, Arc<Mutex<()>>>`. On a 1362-package install with ~100 files/package, that's ~136k `DashMap::entry` operations plus the mutex acquire. The map is never pruned.

  • The lock exists to coordinate writers vs. concurrent verifiers (`check_pkg_files_integrity` may delete the file while a writer is still appending; see the doc comment). For CAFS paths that are written exactly once per install — which is the vast majority of paths — the lock is dead weight.
  • Investigate: is there a way to skip the lock when we know the path is fresh? A `(stripe, hash[0])` lock array (e.g. 256 mutexes keyed by first byte) might be cheaper than the per-path `DashMap` and just as correct since the verifier only races writers on the same path.
  • pnpm's `store/cafs/src/writeFile.ts` uses `locker: Map<string, number>` — a refcount, not a mutex per path. Worth modelling here.

3. `StoreIndex::get` SQLite calls per snapshot

`pacquet/crates/store-dir/src/store_index.rs` exposes `get(key)` which the install path currently calls once per snapshot during `run_with_mem_cache`'s store-lookup branch. With `prefetch_cas_paths` running once at the install head (1362 keys batched), the per-snapshot `get` should be a hit-or-miss against the prefetched map and never touch SQLite. But on a cold or partial `store-dir`, snapshots that the prefetch missed still serialize on the shared `Arc<Mutex>`.

  • Audit `pacquet/crates/tarball/src/lib.rs::run_with_mem_cache` and `load_cached_cas_paths` for the cold-miss path. Are we taking the `Arc<Mutex>` lock per snapshot when we already have a populated `prefetched_cas_paths` for the rest of the install?
  • Consider exposing `store_index.bulk_get(&[key]) -> HashMap` on the `Arc` directly so the install pass can run a single `SELECT ... WHERE key IN (...)` rather than N round-trips behind the mutex.

4. `check_file` (warm reinstall) — per-file `fs::metadata`

`pacquet/crates/store-dir/src/check_pkg_files_integrity.rs:410` stats every file to compare `mtime` against `checked_at`. On a warm reinstall (full-resolution scenario), this fires once per file per package — ~130k stat syscalls. The `verified_files_cache` is supposed to dedup but only at the `(file_path)` level; multiple snapshots referencing the same CAFS blob still re-stat.

  • Confirm `SharedVerifiedFilesCache` is actually deduping at the path level on this workload (sample says it's still hot, so maybe it isn't). `pacquet/crates/store-dir/src/lib.rs` `SharedVerifiedFilesCache`.
  • pnpm verifies once per blob per process and caches `{ ino, dev }` so the second consumer of a popular CAFS path (e.g. `react/index.js`) just compares inode numbers. `pacquet/crates/package-manager/src/import_indexed_dir.rs` could grow a similar cache so the `link_file` fast path skips the stat entirely when the source has been verified this install.

5. `load_meta` still appears in samples (~145 in install-phase, ~110 in resolve-phase)

The packument cache fix in `c5562c8d01` collapsed most of the resolve-side hits, but `load_meta` is still in the top 10. Spot-check what's hitting it post-fix — most likely the cold-mirror path on a brand-new registry where the first install per process has to materialize the mirror from network. Less impact than the items above but worth a profile pass after #1-3 land.

Reproducing the bench

just registry-mock launch                            # one-time
cd /Volumes/src/pnpm/pnpm/<worktree>
cargo run --release --bin=integrated-benchmark -- \\
  --scenario clean-install \\
  --registry-port <port-from-launch> \\
  --runs 5 --warmup 2 \\
  --with-pnpm \\
  pacquet@HEAD pacquet@main

For full-resolution, swap `--scenario full-resolution`. For per-phase timing, set `TRACE=pacquet=info` and run `./bench-work-env/pacquet@HEAD/pacquet/target/release/pacquet install` directly — the per-phase `elapsed_ms` is emitted to stderr via `tracing::info!(target: "pacquet::install::phase", ...)` at `pacquet/crates/package-manager/src/install_with_fresh_lockfile.rs` (search for "phase complete").

For CPU sampling on macOS with symbols:

cargo build --profile release-debug --bin pacquet
cp target/release-debug/pacquet bench-work-env/pacquet@HEAD/pacquet/target/release/pacquet
cd bench-work-env/pacquet@HEAD && rm -rf node_modules pnpm-lock.yaml store-dir
cp .saved-package.json package.json
./pacquet/target/release/pacquet install &
PID=\$!; sleep 4; sample \$PID 10 -file /tmp/p.txt; wait \$PID
awk '/Thread_.*tokio-rt-worker/{flag=1} /Thread_.*:\$/{flag=0} flag && /pacquet_/' /tmp/p.txt \\
  | grep -v 'tokio::\\|scheduler' \\
  | awk -F'pacquet_' '{print \$NF}' | awk -F'::' '{print \$1, \$2, \$3, \$4}' \\
  | sort | uniq -c | sort -rn | head -30

Gotchas the bench harness will trip you on

  • `cache_dir` is project-global, not bench-scoped. With `XDG_CACHE_HOME=/.cache` set, pacquet writes mirrors to `/.cache/pnpm/v11/metadata/localhost+/`. The harness only wipes `bench-work-env/pacquet@HEAD/{node_modules,pnpm-lock.yaml,store-dir}` between iterations, so the metadata mirror is warm across runs. Verify with `ls ~/.cache/pnpm/v11/metadata/localhost+ | wc -l` after the first iteration.
  • The verdaccio mock degrades under sustained load — its RSS grows and request latency creeps up over a long bench session. If you see pacquet's wall time monotonically growing across runs while user-CPU stays flat, restart the mock (`just registry-mock end && just registry-mock launch`). It picks a new port each time, so update `--registry-port` accordingly.
  • The integrated-benchmark requires running from the repo root so `canonicalize(".")` resolves to the pacquet git repo. The verifier now accepts a `.git` file (linked git worktree) as well as a directory — see the verifier change in perf(pacquet): close the clean-install gap to pnpm CLI #11856.
  • `pacquet@HEAD` builds inside `bench-work-env/` — it does a fresh `git fetch` + checkout of the SHA, then `cargo build --release`. Two minutes per cold build. Working-tree changes are not picked up until you commit them and the bench reruns the build.

Pointers used in the prior session

  • Pipelined fetch: `pacquet/crates/package-manager/src/prefetching_resolver.rs` (new file from perf(pacquet): close the clean-install gap to pnpm CLI #11856).
  • Resolve-time meta cache fix: `pacquet/crates/resolving-npm-resolver/src/pick_package.rs:517-555` (version-spec / publishedBy fast paths).
  • Prefetch dedup + read-lock: `pacquet/crates/tarball/src/lib.rs:1825-1855` and `pacquet/crates/package-manager/src/prefetching_resolver.rs:140-170`.
  • Link-file stat trim: `pacquet/crates/package-manager/src/link_file.rs:119` and the helper `try_import` at `:230`.
  • Symlink mkdir trim: `pacquet/crates/package-manager/src/symlink_package.rs:46`.
  • Bench harness: `pacquet/tasks/integrated-benchmark/` (CLI args, work-env construction, hyperfine driver).
  • Per-phase timing emits: search `tracing::info!(target: "pacquet::install::phase"` in `pacquet/crates/package-manager/src/install_with_fresh_lockfile.rs`.

Suggested attack order

Start with #1 (tarball extract parallelism) — biggest sample share, likely 5-10s of wallclock on clean-install. Re-bench. Then #3 (StoreIndex bulk lookup) if cold-store install is still slow. #2 and #4 are smaller wins but cheap to land. #5 last, after re-sampling.


Written by an agent (Claude Code, claude-opus-4-7).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions