Follow-up to #11832 / #11837.
#11837 closed most of the gap on the alotta-files fixture under warm cache + global virtual store + no lockfile, but a wall-clock gap to the TypeScript pnpm CLI remains and the source isn't obvious from the per-phase trace alone.
Current state
|
time |
| pacquet (best of several runs after #11837) |
5.03 s |
| pacquet (typical run) |
5.0 - 5.7 s |
| pnpm CLI |
4.16 s |
| gap (best case) |
~0.87 s (~21%) |
For context, before #11837 pacquet was at 11.83 s on this scenario — so the PR closed ~6.8 s and what remains is the long tail.
Per-phase trace
TRACE=pacquet::install::phase=info pacquet install on a fresh rm -rf pnpm-lock.yaml node_modules:
phase: "resolve_importer" elapsed_ms: 3125 nodes: 1362
phase: "prefetch_cas_paths" elapsed_ms: 69 cache_keys: 1362 hits: 1362
phase: "build_fresh_lockfile" elapsed_ms: 3
phase: "virtual_store_layout_new" elapsed_ms: 11
phase: "install_subtree" elapsed_ms: 1440
Wall = 5.03 s, sum of traced phases ≈ 4.65 s, the remaining ~0.4 s is state init / lockfile write / .modules.yaml / bin linking.
The resolve walk dominates (~3.1 s of ~5 s).
What #11837 already changed
In order, with the per-step phase impact called out:
perf(pacquet): pipeline tarball fetches with resolution (PrefetchingResolver) — later removed in favour of (2).
perf(pacquet/resolving-npm-resolver): dedup concurrent packument fetches (PackumentFetchLocker, port of upstream's metafileOperationLimits + runLimited).
fix(pacquet/resolving-npm-resolver): forward etag / modified on the abbreviated → full upgrade fetch.
perf(pacquet/resolving-npm-resolver): move mirror disk reads off the tokio worker via spawn_blocking.
fix(pacquet/tarball): emit pnpm:progress exactly once per URL (fixed reused: 164746 for 1362 packages).
perf(pacquet/package-manager): batched prefetch_cas_paths on the fresh-lockfile install path (port of create_virtual_store's frozen-path shape).
perf(pacquet/resolving-npm-resolver): share Package via Arc in the meta cache (avoid deep-cloning packuments on hits).
perf(pacquet/resolving-resolver-base): share ResolveResult.manifest via Arc.
perf(pacquet/resolving-npm-resolver): dedup picked-manifest serde_json::to_value per (name, version).
perf(pacquet/resolving-deps-resolver): swap tokio::sync::Mutex for std::sync::Mutex on TreeCtx.
perf(pacquet/resolving-deps-resolver): share ResolveResult itself via Arc on ResolvedPackage / DependenciesGraphNode.
What to investigate next
Best done with a profile attached. Quick path:
cargo install --locked samply
cargo build --profile release-debug -p pacquet-cli
rm -rf pnpm-lock.yaml node_modules
samply record ./target/release-debug/pacquet install
samply record opens a profiler.firefox.com URL with the flamegraph; attach a screenshot or link.
Hypotheses to evaluate against the profile (in roughly decreasing-likelihood order):
Package.versions: HashMap<String, PackageVersion> is cloned per pick — pick_matching_version_* returns owned PackageVersion cloned from Package.versions.get(version). Wrapping each version in Arc (HashMap<String, Arc<PackageVersion>>) collapses the pick to an Arc::clone. Likely ~50-200 ms.
async_recursion::async_recursion on resolve_node allocates a Box<dyn Future> per recursive call (1362+). Worth measuring whether removing the macro and using an explicit Pin<Box<Future>> only at the recursion edge (or a manual stack-managed walker) is cheaper. Probably small.
Vec<String> next_ancestors is cloned per child edge in resolve_node (one clone per edge in the tree). Switching to a Arc<im::Vector<String>> or a cons-list / persistent structure trades the per-child Vec clone for an Arc::clone. Probably small.
futures_util::try_join_all polling overhead when the resolver returns synchronously from the meta cache (no actual IO). Consider FuturesUnordered instead, or batching the synchronous-cache-hit branch off the futures path.
- HTTP-layer overhead per packument (304 responses through
reqwest). pnpm uses make-fetch-happen; if the per-request setup cost is meaningfully higher in reqwest, this would show in the profile as hyper / rustls frames.
If install_subtree variance is significant (we've seen ~1.4-2.3 s across runs), that's also worth pinning down — the file-system work should be deterministic given the same starting state.
Reproduction
- macOS, ARM (M-series), warm pnpm store at the default location.
- The
alotta-files benchmark fixture (/Volumes/src/pnpm/pnpm.io/benchmarks/fixtures/alotta-files) — 110 direct deps, 1362 resolved packages.
pnpm-workspace.yaml with enableGlobalVirtualStore: true.
- Default
minimumReleaseAge: 1440 (24 h).
Out of scope for this issue
Everything #11837 already shipped, including the per-phase tracing the TRACE=pacquet::install::phase=info invocation above relies on.
Written by an agent (Claude Code, claude-opus-4-7).
Follow-up to #11832 / #11837.
#11837 closed most of the gap on the
alotta-filesfixture under warm cache + global virtual store + no lockfile, but a wall-clock gap to the TypeScript pnpm CLI remains and the source isn't obvious from the per-phase trace alone.Current state
For context, before #11837 pacquet was at 11.83 s on this scenario — so the PR closed ~6.8 s and what remains is the long tail.
Per-phase trace
TRACE=pacquet::install::phase=info pacquet installon a freshrm -rf pnpm-lock.yaml node_modules:Wall = 5.03 s, sum of traced phases ≈ 4.65 s, the remaining ~0.4 s is state init / lockfile write /
.modules.yaml/ bin linking.The resolve walk dominates (~3.1 s of ~5 s).
What #11837 already changed
In order, with the per-step phase impact called out:
perf(pacquet): pipeline tarball fetches with resolution (PrefetchingResolver) — later removed in favour of (2).perf(pacquet/resolving-npm-resolver): dedup concurrent packument fetches (PackumentFetchLocker, port of upstream'smetafileOperationLimits+runLimited).fix(pacquet/resolving-npm-resolver): forwardetag/modifiedon the abbreviated → full upgrade fetch.perf(pacquet/resolving-npm-resolver): move mirror disk reads off the tokio worker viaspawn_blocking.fix(pacquet/tarball): emitpnpm:progressexactly once per URL (fixedreused: 164746for 1362 packages).perf(pacquet/package-manager): batchedprefetch_cas_pathson the fresh-lockfile install path (port ofcreate_virtual_store's frozen-path shape).perf(pacquet/resolving-npm-resolver): sharePackageviaArcin the meta cache (avoid deep-cloning packuments on hits).perf(pacquet/resolving-resolver-base): shareResolveResult.manifestviaArc.perf(pacquet/resolving-npm-resolver): dedup picked-manifestserde_json::to_valueper(name, version).perf(pacquet/resolving-deps-resolver): swaptokio::sync::Mutexforstd::sync::MutexonTreeCtx.perf(pacquet/resolving-deps-resolver): shareResolveResultitself viaArconResolvedPackage/DependenciesGraphNode.What to investigate next
Best done with a profile attached. Quick path:
samply recordopens aprofiler.firefox.comURL with the flamegraph; attach a screenshot or link.Hypotheses to evaluate against the profile (in roughly decreasing-likelihood order):
Package.versions: HashMap<String, PackageVersion>is cloned per pick —pick_matching_version_*returns ownedPackageVersioncloned fromPackage.versions.get(version). Wrapping each version inArc(HashMap<String, Arc<PackageVersion>>) collapses the pick to anArc::clone. Likely ~50-200 ms.async_recursion::async_recursiononresolve_nodeallocates aBox<dyn Future>per recursive call (1362+). Worth measuring whether removing the macro and using an explicitPin<Box<Future>>only at the recursion edge (or a manual stack-managed walker) is cheaper. Probably small.Vec<String>next_ancestorsis cloned per child edge inresolve_node(one clone per edge in the tree). Switching to aArc<im::Vector<String>>or a cons-list / persistent structure trades the per-child Vec clone for anArc::clone. Probably small.futures_util::try_join_allpolling overhead when the resolver returns synchronously from the meta cache (no actual IO). ConsiderFuturesUnorderedinstead, or batching the synchronous-cache-hit branch off the futures path.reqwest). pnpm usesmake-fetch-happen; if the per-request setup cost is meaningfully higher inreqwest, this would show in the profile ashyper/rustlsframes.If install_subtree variance is significant (we've seen ~1.4-2.3 s across runs), that's also worth pinning down — the file-system work should be deterministic given the same starting state.
Reproduction
alotta-filesbenchmark fixture (/Volumes/src/pnpm/pnpm.io/benchmarks/fixtures/alotta-files) — 110 direct deps, 1362 resolved packages.pnpm-workspace.yamlwithenableGlobalVirtualStore: true.minimumReleaseAge: 1440(24 h).Out of scope for this issue
Everything #11837 already shipped, including the per-phase tracing the
TRACE=pacquet::install::phase=infoinvocation above relies on.Written by an agent (Claude Code, claude-opus-4-7).