Skip to content

perf(npm-resolver): normalize non-abbreviated registry metadata before caching#12766

Draft
astegmaier wants to merge 1 commit into
pnpm:mainfrom
astegmaier:ansteg/normalize-registry-metadata-on-ingest
Draft

perf(npm-resolver): normalize non-abbreviated registry metadata before caching#12766
astegmaier wants to merge 1 commit into
pnpm:mainfrom
astegmaier:ansteg/normalize-registry-metadata-on-ingest

Conversation

@astegmaier

@astegmaier astegmaier commented Jul 2, 2026

Copy link
Copy Markdown

Summary

pnpm requests abbreviated package metadata (Accept: application/vnd.npm.install-v1+json) — the slim document with only the fields needed to install. Some registries ignore that header and return the full document (npm's abbreviated format is a vendor convention, not something every registry implements). Azure DevOps Artifacts is one such registry: it always returns the full CouchDB document, including per-version scripts, exports, devDependencies, readme, and arbitrary custom package.json fields.

pnpm stored that response verbatim in the abbreviated metadata cache, so every later resolution re-read and re-parsed all of those unused fields.

This PR normalizes the abbreviated slot down to the abbreviated field set before caching, by reusing the existing clearMeta helper (previously applied only for filterMetadata). The resolver already ignores the dropped fields, so resolution output is byte-identical — the cache just gets much smaller and faster to parse.

The problem, concretely

On a large monorepo at Microsoft whose workspace resolves against an Azure DevOps Artifacts feed, one internal package's abbreviated metadata was 39.7 MB — of which only ~12% is install-relevant:

field (summed over all versions) share of document
scripts 51%
devDependencies 15%
exports 9%
dependencies (needed) 9%
custom fields (_id, readme, dev-tooling config, …) ~13%
dist (needed) 2%

The same package on the public npm registry (which honors the abbreviated header) is 0.52 MB. I confirmed the registry behavior directly: requesting that package from Azure DevOps with Accept: application/vnd.npm.install-v1+json returns byte-for-byte the same response as Accept: application/json — it does no content negotiation. The public npm registry returns 3.5× less for the abbreviated header.

Because dedupe/resolution parses each package's metadata (potentially once per dependent), the extra ~88% of every document dominates wall time and heap.

The fix

clearMeta already reduces a document to the abbreviated version object field set (plus libc, and top-level time/dist-tags/modified). It was only applied on the filterMetadata (full-slot) path. This PR applies it to the abbreviated slot as well, so anything cached there conforms to the abbreviated shape regardless of what the registry actually sent:

  • Scoped to plain abbreviated results — full-metadata requests and minimumReleaseAge-upgraded documents are untouched.
  • clearMeta is made null-safe for packages with no versions (unpublished), which the abbreviated path can now reach.

Why resolution is unchanged

clearMeta keeps every field the resolver reads (dependencies/optionalDependencies/peerDependencies*/dist/engines/ cpu/os/libc/bin/bundleDependencies/deprecated/hasInstallScript/ _npmUser) and the top-level time (so minimumReleaseAge still works). The only registry-manifest read of a dropped field is resolveDependencies.ts reading pkg.scripts?.prepare, which is guarded by resolvedVia === 'git-repository' — never reached for registry packages (git dependencies get their manifest from the cloned repo, not the packument).

Validated end-to-end: pnpm dedupe --offline on the affected monorepo produces a byte-identical pnpm-lock.yaml whether the cached metadata is full or normalized.

Performance

Warm-cache pnpm dedupe --lockfile-only --offline on that monorepo (~5,000 external packages), best of repeated runs:

metadata cache wall peak RSS
full (before) ~138 s 3.59 GB
normalized (after) ~67 s 2.54 GB

~2.1× faster and ~29% less memory, from ~55% smaller abbreviated documents overall (the large full-document packages shrink to ~12%; already-abbreviated public packages are unchanged).

Testing

  • New unit test in metaCache.test.ts: a registry that returns the full document (with scripts, exports, readme, etc.) results in a normalized cached document (unused fields dropped, install fields kept). Fails on main, passes here.
  • @pnpm/resolving.npm-resolver: full suite passes (234 tests), compile + lint clean.
  • Full changed-package closure (--filter=...[origin/main]): passes (one deps.inspection.commands outdated test is environment-flaky and fails identically on main, unrelated to this change).

Notes

  • This only touches how pnpm persists abbreviated metadata; the wire request is unchanged.
  • Registries that already honor the abbreviated header (npmjs.org) are unaffected in content — clearMeta keeps exactly the fields they already send.
  • A natural follow-up is to apply the same normalization to the full-metadata slot for non-filterMetadata resolvers, which would also shrink the metadata-full cache; left out here to keep the change focused.

…e caching

Some registries (e.g. Azure DevOps Artifacts) ignore the abbreviated metadata
Accept header (application/vnd.npm.install-v1+json) and return the full package
document. pnpm stored that response verbatim in the abbreviated metadata cache,
so every later resolution re-read and re-parsed fields the resolver never uses
(scripts, exports, devDependencies of transitive deps, readme, custom fields).
On a large workspace whose feed serves full documents, a single package's
metadata was ~40 MB (of which ~88% was unused) instead of a few MB.

Apply the existing clearMeta normalization to the abbreviated slot too (it was
previously only applied for filterMetadata). clearMeta keeps every abbreviated
field (and top-level time), so resolution output is unchanged; it drops the
fields the resolver never reads. This roughly halves `dedupe --offline` wall
time and lowers peak memory on such a workspace, with a byte-identical lockfile.

Scoped to plain abbreviated results: full-metadata and minimumReleaseAge-upgraded
documents are left untouched. clearMeta is also made null-safe for packages with
no versions (unpublished), which the abbreviated path can now reach.
@welcome

welcome Bot commented Jul 2, 2026

Copy link
Copy Markdown

💖 Thanks for opening this pull request! 💖
Please be patient and we will get back to you as soon as we can.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: eca47981-a5f7-4654-8cea-f93921691dfb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

zkochan pushed a commit to astegmaier/pnpm that referenced this pull request Jul 4, 2026
…e network layer

pnpm requests abbreviated package metadata (Accept:
application/vnd.npm.install-v1+json). Some registries (e.g. Azure DevOps
Artifacts) ignore that header and return the full CouchDB document, including
per-version fields the installer never reads (scripts, exports,
devDependencies, readme, custom package.json fields). pnpm stored that response
verbatim in the abbreviated metadata cache, so every later resolution re-read
and re-parsed those unused fields — on a large workspace whose feed serves full
documents, a single package's metadata was ~40 MB, ~88% of it unused.

Fix the response at the boundary where it enters pnpm: fetchMetadataFromFromRegistry
inspects the response Content-Type. When an abbreviated request comes back with
anything other than application/vnd.npm.install-v1+json, the registry ignored
the header and served the full document, so the fetch layer strips it to the
abbreviated field set (via the existing clearMeta helper) and re-serializes.
Registries that honor the header (the npm registry echoes the abbreviated
Content-Type) are detected as already-abbreviated and take an early return: no
clearMeta, no re-serialization — the happy path pays nothing.

clearMeta is extracted from pickPackage.ts into its own module so the network
layer can use it without a circular import, and is made null-safe on versions
so it can run on an unpublished package. pickPackage's abbreviated slot no
longer needs any special-casing: the fetch layer now guarantees abbreviated
shape, so the resolver keeps only its pre-existing filterMetadata (full-slot)
narrowing.

This is an alternative to pnpm#12766, which applied the same normalization
one layer up in pickPackage. Both produce a byte-identical on-disk cache and
byte-identical resolution output; this variant additionally skips the work
entirely for spec-compliant registries and removes the special-casing from the
resolver.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant