Skip to content

Prevent cold-batch link serialization (and similar pnpr cold-path regressions) from recurring #12250

Description

@zkochan

Background

#12249 fixed a performance bug where the cold batch in CreateVirtualStore serialized its per-package slot links: each InstallPackageBySnapshot future called the blocking CreateVirtualDirBySnapshot::run (rayon::join) directly inside one cooperative try_join_all task, so the ~1300 links ran one-at-a-time instead of in a single parallel pass like the warm batch already did.

It only manifested on the pnpr client path: pnpr's fast offloaded resolution starts materialization before the prefetcher has finished, so most packages land in the cold batch (measured warm=312 cold=996 on a fresh ~1300-package install via a real pnpr server, vs warm=1308 cold=0 for a local fresh install whose slower resolution warms everything first). The serialization cost it ~14.5s in CreateVirtualStore vs ~5-7s for the warm path; the fix flipped pnpr from ~4s slower than a direct install to ~3s faster.

Three things let it through — the code regressed, the test suite didn't catch it, and the benchmark didn't flag it. This issue tracks guardrails for all three.

Follow-ups

1. Make the bug class structurally unrepresentable (code)

The warm and cold batches now both link via rayon par_iter, but the two passes are duplicated in create_virtual_store.rs. A future edit can re-break one without the other. Extract a single link_slots_parallel(...) helper that both batches call, so there is exactly one slot-linking path and it is parallel by construction.

2. Deterministic regression test (test)

Assert the structural property, not timing. Add an install-scoped counter that CreateVirtualDirBySnapshot bumps on entry / decrements on exit, recording the max concurrent links. Add a test that forces an all-cold batch deterministically (prefetch tarballs into the mem cache but leave the store index empty, so the warm/cold partition routes them all cold without any timing race) and asserts max_concurrent_links >= 2. Serialized linking pins it at 1; the parallel pass gives ~num_cpus. No wall-clock flakiness. The existing pnpr_install tests cover cold-batch correctness but not concurrency.

3. Close the integrated-benchmark blind spots (observability)

The benchmark ran the cold path (it sets --registry-latency-ms=50 --registry-bandwidth-mbps=200 and runs cold-store scenarios) but couldn't surface the cost:

  • Emit warm=N cold=M as a benchmark metric and assert a fresh pnpr install has a non-trivial cold batch. Canary: proves the benchmark is actually exercising the cold path; if a change accidentally warms everything, the gate fires instead of silently testing nothing.
  • Watch the absolute pnpr@HEAD / pacquet@HEAD ratio with the prior that it should be ≤ 1, not just HEAD-vs-main deltas. The serialization was a long-standing constant, so delta-tracking was blind to it; a "pnpr should not be slower than a direct install" threshold would have failed.
  • Give the pnpr server a low registry latency (separate knob, e.g. --pnpr-server-registry-latency-ms ≈ 0) distinct from the client's. The workflow currently charges the pnpr server the same REGISTRY_LATENCY_MS as a remote client, which slows its resolution and shrinks the modeled cold batch below what a production (co-located) server produces — under-representing exactly the cost this bug scaled with.
  • Report CreateVirtualStore time as its own line item so future materialize-phase regressions localize to a phase instead of hiding in a total averaged over the network.

Suggested order

#1 + #2 are the hard gates (block it at cargo test / review); #3 is observability so subtler future regressions show up in the perf report.


Written by an agent (Claude Code, claude-opus-4-8).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions