Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

feat(store): decode pnpm-written msgpackr-records rows in index.db#251

Merged
zkochan merged 6 commits into
mainfrom
msgpackr-records-decoder
Apr 23, 2026
Merged

feat(store): decode pnpm-written msgpackr-records rows in index.db#251
zkochan merged 6 commits into
mainfrom
msgpackr-records-decoder

Conversation

@zkochan

@zkochan zkochan commented Apr 23, 2026

Copy link
Copy Markdown
Member

Summary

Closes the last open item on #244: pacquet can now read index.db rows that pnpm wrote. Before this PR, StoreIndex::get would fail to decode anything pnpm produced (msgpackr records), silently treat it as a cache miss, and trigger a full re-download — so the "shared v11 store" was only shared one direction.

  • New crates/store-dir/src/msgpackr_records.rs — a focused transcoder that walks a msgpackr-records byte stream and emits equivalent plain MessagePack by expanding each record instance into a string-keyed map. Keeping this as a transcoder (not a fresh Deserializer) lets us reuse the existing Deserialize derive on PackageFilesIndex.
  • StoreIndex::get sniffs the leading d4 72 bytes and routes msgpackr-records rows through the transcoder before handing the result to rmp_serde. pacquet-written rows still take the zero-copy fast path.
  • The transcoder narrows integer-valued float 64 / float 32 payloads to uint 64. msgpackr emits JS Number as a double whenever the value exceeds int32 range, so millisecond timestamps like checkedAt = 1_700_000_000_000 arrive as floats even though they're semantically ints, and rmp_serde rejects floats for our size: u64 / checked_at: Option<u128> fields. Non-integer floats (π, etc.) pass through unchanged.
  • Drops #[ignore] on pnpm_compatibility::same_index_file_contents. End-to-end: pnpm install writes rows, pacquet reads them back, and the snapshot of decoded contents now matches what pacquet itself would have written.

Why a transcoder, not a bespoke format

I considered having pacquet use its own index format in a separate db. Tradeoffs ruled against it: the shared-store promise of #244 requires the two tools to agree on at least the format, and bypassing records would regress on encode size (records buys ~30-50% on tarballs with many files) without moving any other dial.

Wire format recap (from the msgpackr README + real bytes)

  • Record def: d4 72 <slot> (fixext1, ext type 0x72 'r', 1-byte payload = slot id in 0x40..=0x7f) + msgpack array of N field names + N raw msgpack values (the first instance, inlined).
  • Record ref (subsequent instances): single byte in 0x40..=0x7f + N raw msgpack values.
  • Everything else is vanilla MessagePack. Despite moreTypes: true, pnpm's payloads encode JS Map as standard fixmap / map16 / map32 — no ext wrapper.

Test plan

  • cargo test -p pacquet-store-dir --lib — 22 passing, 10 new in msgpackr_records::tests using fixture bytes captured from msgpackr 1.11.8 in the pnpm v11 worktree:
    • decodes_one_file_fixture_from_msgpackr / decodes_two_file_fixture_with_record_reuse — round-trip + slot reuse
    • decodes_requires_build_true / decodes_file_without_checked_at / decodes_side_effects — nested records and optional fields
    • passes_through_plain_msgpack_unchanged — guards against the transcoder mangling pacquet-written rows
    • non_integer_floats_pass_through / integer_valued_float32_is_narrowed_to_uint64 — float-narrowing behaviour
    • rejects_reference_to_unknown_slot / rejects_truncated_buffer — error paths
    • store_index::tests::get_decodes_msgpackr_records_rows — end-to-end through StoreIndex::get via a direct SQL insert
  • cargo test -p pacquet-cli --test pnpm_compatibility same_index_file_contents — previously #[ignore]d, now passes. The test has pnpm install a package from the registry-mock, then has pacquet decode pnpm's index.db rows and assert the decoded shape matches what pacquet itself writes.
  • cargo test -p pacquet-tarball --lib — 7 passing, confirms the read-cache path in run_without_mem_cache (which calls index.get) automatically benefits with no further wiring.
  • cargo build --workspace --all-targets
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo fmt --all -- --check

Closes #244.

Finishes the last open item on #244. pnpm packs every PackageFilesIndex
with `new Packr({ useRecords: true, moreTypes: true })`. Records use a
msgpackr-private extension (ext type 0x72, "r") that rewrites the
string keys of repeated same-shape structs into 1-byte slot references,
so the payload `rmp_serde` saw in a pnpm-written index.db row was
effectively garbage — the cache-lookup in `StoreIndex::get` would
error, pacquet would fall through to a fresh download, and the whole
"shared v11 store" promise evaporated for rows pnpm wrote first.

- New `crates/store-dir/src/msgpackr_records.rs`: a focused transcoder
  that walks a msgpackr-records byte stream and emits equivalent plain
  MessagePack (expanding each record instance into a string-keyed map).
  Keeping this as a transcoder — rather than a fresh Deserializer —
  lets us reuse the existing `Deserialize` derive for
  `PackageFilesIndex`. 270 lines of decoder + 10 unit tests using real
  fixture bytes captured from msgpackr 1.11.8 (via
  `store/index/node_modules/msgpackr` in the pnpm v11 worktree).
- `StoreIndex::get` now sniffs the leading `d4 72` bytes and routes
  msgpackr-records rows through the transcoder before handing the
  result to `rmp_serde`. Pacquet-written rows (plain msgpack) still
  take the zero-copy fast path — no transcoding, no extra allocation.
- The transcoder also narrows integer-valued `float 64`/`float 32`
  values to `uint 64`. Msgpackr emits JS Number as float whenever the
  value exceeds int32 range, so millisecond timestamps like
  `checkedAt = 1_700_000_000_000` arrive as a double even though
  they're semantically ints — and `rmp_serde` rejects floats for our
  integer-typed `size: u64` / `checked_at: Option<u128>` fields.
  Non-integer floats (π and friends) pass through unchanged.
- Drops `#[ignore]` on `crates/cli/tests/pnpm_compatibility::same_index_file_contents`.
  End-to-end: `pnpm install` writes rows, pacquet reads them back, and
  the snapshot of decoded contents now matches what pacquet itself
  would have written.

Refs #244.
@zkochan zkochan marked this pull request as ready for review April 23, 2026 19:49
@zkochan zkochan requested a review from Copilot April 23, 2026 19:49
@codecov

codecov Bot commented Apr 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.18386% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.62%. Comparing base (15741af) to head (77867a3).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/store-dir/src/msgpackr_records.rs 81.72% 74 Missing ⚠️
crates/store-dir/src/store_index.rs 97.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #251      +/-   ##
==========================================
- Coverage   88.26%   87.62%   -0.64%     
==========================================
  Files          56       57       +1     
  Lines        3179     3621     +442     
==========================================
+ Hits         2806     3173     +367     
- Misses        373      448      +75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds compatibility for pnpm v11’s index.db rows by decoding msgpackr “records” payloads so pacquet can correctly read pnpm-written package index entries from the shared store.

Changes:

  • Introduces a msgpackr-records → plain MessagePack transcoder (msgpackr_records.rs) and related tests.
  • Updates StoreIndex::get to detect pnpm-written rows (marker d4 72) and transcode before rmp_serde deserialization.
  • Re-enables the pnpm compatibility test for same_index_file_contents now that pnpm-written rows decode successfully.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
crates/store-dir/src/store_index.rs Sniffs msgpackr-records marker and routes pnpm rows through a transcode+decode path; adds end-to-end decoding test.
crates/store-dir/src/msgpackr_records.rs New transcoder that expands msgpackr record instances into string-keyed maps and narrows integer-valued floats; includes unit tests.
crates/store-dir/src/lib.rs Exposes the new transcoder module via re-export.
crates/cli/tests/pnpm_compatibility.rs Drops #[ignore] and updates commentary since pnpm index rows are now readable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
@github-actions

github-actions Bot commented Apr 23, 2026

Copy link
Copy Markdown

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.00     19.7±0.88ms   219.8 KB/sec    1.00     19.8±0.95ms   219.1 KB/sec

zkochan added 2 commits April 23, 2026 22:02
Symmetric follow-up to the previous commit: with just msgpackr-records
decoding in place, pacquet can read what pnpm wrote, but pnpm could
not reliably read what *pacquet* wrote. Verified empirically by
feeding pacquet's output to `new Packr({useRecords: true, moreTypes:
true})` from the pnpm v11 workspace's msgpackr 1.11.8.

Two issues surfaced, both in how pacquet encoded `CafsFileInfo`:

1. `checked_at: Option<u128>` — `rmp_serde` has no native MessagePack
   integer encoding for `u128` and fell back to writing the value as a
   `bin 16` (16 raw big-endian bytes). msgpackr decoded the field as
   a `Uint8Array`, not a timestamp, so pnpm's integrity check would
   have silently misbehaved. Narrow the type to `u64` — millisecond
   timestamps fit in `u64` until year ~584M, and `u64` maps cleanly
   to MessagePack's integer encodings.
2. Even after the u128 → u64 fix, writing as `uint 64` (`cf`) made
   msgpackr return the value as a JS `BigInt`. pnpm's CAFS integrity
   check does `mtimeMs - (checkedAt ?? 0)`; mixing `Number` and
   `BigInt` throws `TypeError: Cannot mix BigInt and other types` at
   runtime. Pnpm itself sidesteps this because JS Number is a double
   — msgpackr packs large integer-valued Numbers as `float 64` (`cb`).
   Match that: add a `serialize_with` that emits `checked_at` as
   `float 64`. Verified: msgpackr now returns the value as
   `typeof === 'number'`, bit-identical byte encoding to what pnpm
   itself produces.

Supporting read-side changes:

- Add `deserialize_with` on `checked_at` that accepts either integer
  or integer-valued float. Required because plain msgpack reads bypass
  the records transcoder (the transcoder would reinterpret legitimate
  positive fixints in 0x40..=0x7f as record slot references, which is
  correct under records mode but wrong for pacquet-written rows).
- Restore the sniff on `decode_index_value` — only invoke the
  transcoder when the bytes begin with the msgpackr records marker
  `d4 72`. An earlier version of this commit tried "always transcode"
  for simplicity; it broke every pacquet-written row whose `size` or
  `mode` happened to land in the 0x40..=0x7f byte range.

Test updates:

- `round_trips_plain_msgpack_through_transcoder` replaces
  `passes_through_plain_msgpack_unchanged`: the byte-for-byte
  guarantee no longer holds (the transcoder now narrows integer-valued
  floats), but the decoded-struct round-trip still does.
- Add `plain_msgpack_without_floats_passes_through_unchanged` to pin
  down the narrower pass-through guarantee for non-float inputs.

No changes to `tarball/src/lib.rs` beyond the `as_millis() as u64`
cast the u128 → u64 narrowing now requires.
Copilot review round one on #251.

- Track whether the transcoder has seen a record definition yet, and
  only treat `0x40..=0x7f` as slot references once records mode is on.
  Until then those bytes are positive fixints (64..=127), the same as
  vanilla MessagePack. Before this change the transcoder couldn't be
  run on plain-msgpack input without corrupting it — pacquet rows with
  `size: 123` or similar would be mis-read as a reference to an
  undefined slot 0x7b. The old code sidestepped the problem with a
  sniff in `decode_index_value` (transcode only when bytes start with
  `d4 72`); records-mode tracking makes the transcoder safe for both
  encodings, so `decode_index_value` drops the sniff and always
  transcodes (which is necessary anyway for the float-to-uint
  narrowing on pacquet's own `checkedAt` field).
- Replace the `v <= u64::MAX as f64` narrowing bound with a strict
  `v < 2f64.powi(64)`. `u64::MAX as f64` rounds *up* to 2^64 because
  2^64 - 1 isn't exactly representable in `f64`, so the old bound
  accepted a literal 2^64 and silently saturated it to `u64::MAX` on
  cast.
- Add `plain_positive_fixint_in_slot_range_passes_through` and
  `float64_equal_to_2_pow_64_passes_through` as regression guards for
  both fixes.
- Drop the now-redundant custom `deserialize_with` on
  `CafsFileInfo::checked_at` — since all reads transcode first and the
  transcoder narrows integer-valued floats back to `uint 64`, the
  value always arrives as an integer for `rmp_serde` to deserialize.

Refs #244.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/store-dir/src/msgpackr_records.rs
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Comment thread crates/store-dir/src/store_index.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/store-dir/src/msgpackr_records.rs
…cro-tweaks

Copilot review round two on #251.

- Split `DecodeError::BadRecordDef` into three precise variants —
  `ExpectedArrayHeader`, `ExpectedStringHeader`, and
  `InvalidFieldNameUtf8`. The old shared variant carried a
  meaningless `count: 0 | 1` indicator that made messages like
  "Expected record-definition header (0 field-name strings in a
  msgpack array)" actively confusing.
- Add `DecodeError::SlotOutOfRange` and reject any record definition
  whose slot byte isn't in `0x40..=0x7f`. msgpackr never emits one
  outside that range; accepting them before meant the slot was
  registered but unreachable, so the first reference-attempt
  elsewhere in the stream would have surfaced as an unrelated
  `UnknownSlot` error.
- Drop three needless `to_vec()` allocations on the array16, array32,
  map16, and map32 header-copy paths — use
  `w.extend_from_slice(r.read_bytes(n)?)` directly.
- Switch the `as_millis() as u64` cast in tarball extraction to
  `u64::try_from(...).ok()` so a hypothetical u128-overflowing
  timestamp drops to `None` instead of silently wrapping.
- Fix a stale doc comment that still referred to an `Option<u128>`
  deserializer (now `Option<u64>`).
- Rewrite the `StoreIndex::get` doc so it matches the current
  always-transcode implementation — no more "sniff the leading two
  bytes" language now that records-mode tracking makes unconditional
  transcoding safe.

Refs #244.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/store-dir/src/store_index.rs
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Respond to two Copilot review notes:

- Slot-reference decoding used to `.clone()` the full Vec<String> of
  field names per record instance. A 200-file row allocated 200
  Vec<String>s plus one String per field per clone. Store the slot
  schema under `Rc<[String]>` so the reference path bumps a refcount
  instead. Single-thread transcoder, so `Rc` over `Arc`.

- `StoreIndex::get` had a doc comment claiming a "sniff the leading
  two bytes" fast path that didn't exist — the implementation always
  transcodes. Explain why (pacquet writes `checkedAt` as float64 on
  purpose for msgpackr/pnpm interop, so every real row needs the
  transcoder to narrow it back), so the always-transcode behaviour
  reads as deliberate, not a missing optimisation.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
Comment thread crates/store-dir/src/msgpackr_records.rs Outdated
@github-actions

Copy link
Copy Markdown

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 3.352 ± 0.036 3.295 3.400 1.00
pacquet@main 3.362 ± 0.027 3.331 3.408 1.00 ± 0.01
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 3.35226689518,
      "stddev": 0.03633311574723908,
      "median": 3.3458843170800003,
      "user": 0.15669046,
      "system": 0.20156964,
      "min": 3.2949777200800003,
      "max": 3.40018571508,
      "times": [
        3.39682071708,
        3.3266309130800003,
        3.32452622908,
        3.39818581408,
        3.33135331708,
        3.40018571508,
        3.35821989208,
        3.33520803508,
        3.2949777200800003,
        3.3565605990800003
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 3.36213749388,
      "stddev": 0.02738371711332662,
      "median": 3.36748964658,
      "user": 0.16164916,
      "system": 0.19736344,
      "min": 3.33132226808,
      "max": 3.4075607870800004,
      "times": [
        3.3671976630800002,
        3.3331965880800003,
        3.33762074808,
        3.3704619850800004,
        3.33132226808,
        3.3748614100800003,
        3.36778163008,
        3.4075607870800004,
        3.3341940930800003,
        3.39717776608
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Two Copilot nits:

- Diagnostic codes said `pacquet_store_dir::msgpackr::…` even though
  the module is `msgpackr_records`. Match the `store_index::…` shape
  used elsewhere in the crate so codes map 1:1 to module paths.
- `UnexpectedEof` said "msgpackr buffer", but the transcoder also
  runs on plain MessagePack (see the `get_decodes_…` path through
  `StoreIndex::get`). Reword to "MessagePack buffer" so the error
  reads correctly in either case.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zkochan zkochan merged commit 0ea1252 into main Apr 23, 2026
15 of 18 checks passed
@zkochan zkochan deleted the msgpackr-records-decoder branch April 23, 2026 22:14
zkochan added a commit that referenced this pull request Apr 23, 2026
…252)

## Summary

Follow-up to #251. Pacquet's `StoreIndex::set` was writing `index.db` rows via `rmp_serde::to_vec_named` — plain MessagePack maps. pnpm reads the same file with `Packr({useRecords: true, moreTypes: true}).unpack(…)`, and in that mode **every msgpack map at any nesting level decodes as a JS `Map`**. Records are the escape hatch that tells msgpackr "this one's a plain object".

So pacquet-written rows came back as a top-level `Map`, `pkgIndex.files` (a property access on that `Map`) was `undefined`, and pnpm's `for (const [f, fstat] of pkgIndex.files)` threw `files is not iterable`. That surfaced in CI as `ERR_PNPM_READ_FROM_STORE` the first time a benchmark run picked up a shared pnpm-v11 cache with pacquet-written rows in it — every downstream job's `pnpm install` crashed trying to decode them.

#251 taught pacquet how to **read** pnpm's records; b0bed43 fixed the `checkedAt` encoding (float64 vs uint64-as-BigInt) and called the pnpm-reads-pacquet direction done — but the container shape was still wrong, and nothing in the test suite actually put a pacquet-written row back through msgpackr, so it slipped past.

## Fix

New `encode_package_files_index` in `msgpackr_records.rs` that emits msgpackr-records bytes. Behaviour matches what msgpackr 1.11.8 itself emits for the same logical input (verified against the version pinned in pnpm v11's `catalog:`):

- **Slot allocation is shape-aware.** Slot `0x40` is reserved for the top-level `PackageFilesIndex`. Inner slots in `0x41..=0x7f` are allocated lazily, one per distinct record shape, in first-seen order. A `CafsFileInfo` carrying `checkedAt` and one without it land in separate slots; same-shape instances downstream collapse to a single bare-slot byte, which is the record-compression win records exist for.
- **Optional fields are omitted from the schema when `None`**, not encoded as `nil`. A `SideEffectsDiff { added: Some, deleted: None }` decodes on the pnpm side to `{ added: Map }`, not `{ added: Map, deleted: null }` — the same JS shape msgpackr itself would produce for the same Rust input, so downstream checks like `diff.added != null` / `diff.deleted != null` behave identically regardless of which tool wrote the row.
- **`CafsFileInfo.checkedAt`** is written as `float 64` when `Some` (uint 64 would decode to a JS `BigInt` and crash pnpm's `mtimeMs - (checkedAt ?? 0)`), omitted from the schema when `None`.
- **`manifest: Some(_)` returns `EncodeError::ManifestNotSupported`.** Pacquet doesn't populate it today; encoding `serde_json::Value` as records is follow-up work.

The encoder also **promotes integer values in `0x40..=0x7f` to `uint 8`** instead of positive fixints. Inside a records stream those bytes are slot-reference markers — a bare `0x7b` would be parsed as a ref to slot `0x7b` rather than the integer `123`. msgpackr does the same promotion for the same reason.

State is a tight `[Option<u8>; 2]` for `CafsFileInfo` slots and `[Option<u8>; 4]` for `SideEffectsDiff` slots, indexed by a shape bitmask — 2 and 4 are the respective maxima, so lookup is a cache-local array index with no `HashMap` allocation on the hot per-file path.

`StoreIndex::set` now calls this encoder. `StoreIndex::get` goes through the existing transcoder which already handled records, so the read path needed no behaviour change — just a doc update to enumerate the three row flavours it now tolerates (pnpm-written records, current pacquet-written records, legacy pacquet plain-msgpack from rows that predate this PR).

## Why not a bespoke format?

Considered writing pacquet's rows in a different shape (positional tuple, or keep plain msgpack and switch pnpm's side). Ruled out: the whole point of #244 and #251 is a shared `index.db` the two tools read each other's entries from. Any shape other than "what pnpm's msgpackr emits" breaks that promise.

## Performance

Faster **and** smaller than `rmp_serde::to_vec_named`. Measured on a release build, 10k iterations per row-size, with a realistic `CafsFileInfo` shape (128-hex digest, millisecond timestamp):

**Serialize time per row**

| files | `rmp_serde` (before) | records encoder (after) | speedup |
|---|---:|---:|---:|
| 1 | 304 ns | 164 ns | **1.9×** |
| 10 | 1.23 µs | 838 ns | **1.5×** |
| 50 | 3.71 µs | 2.25 µs | **1.6×** |
| 200 | 11.9 µs | 8.11 µs | **1.5×** |
| 1000 | 55.9 µs | 27.7 µs | **2.0×** |

**Row size on disk**

| files | `rmp_serde` | records encoder | delta |
|---|---:|---:|---:|
| 1 | 214 B | 220 B | +6 B |
| 10 | 1960 B | 1723 B | **−12%** |
| 50 | 9722 B | 8405 B | **−14%** |
| 200 | 38822 B | 33455 B | **−14%** |
| 1000 | 194022 B | 167055 B | **−14%** |

The 6-byte regression at `n=1` is the outer record header (`d4 72 40`) plus a slightly longer field-name array; from `n≥4` records come out smaller because each subsequent `CafsFileInfo` instance collapses to a single slot byte instead of re-emitting `"digest"`, `"mode"`, `"size"`, `"checkedAt"` (~20 bytes of field names). Why it's faster: no serde runtime indirection (no trait dispatch, no generic-over-Serializer code paths), single pre-sized `Vec`, direct byte writes, zero-alloc slot lookup via the fixed-size arrays.

`StoreIndex::get` is unchanged, so read cost is identical to before. Tarball writes call `encode` once per package, so this shows up at ~10–50 µs/package — drowned out by the SQLite `INSERT` (~100–500 µs) either way, but the payload-size shrink compounds across thousands of packages in the shared v11 store.

## Test plan

- [x] `cargo test -p pacquet-store-dir --lib` — **37 passing, 12 new** in `msgpackr_records::tests`:
  - `encode_emits_record_header_for_top_level_struct` — outer `0xd4 0x72 0x40` record-def header is present.
  - `encode_roundtrips_single_file` / `encode_roundtrips_many_files_sharing_one_slot` — single-file round-trip; N uniform-shape `CafsFileInfo`s produce exactly 2 record defs (outer + one inner shape), the rest are bare slot refs.
  - `encode_handles_fixint_in_slot_range_safely` — `size: 0x7b` round-trips without triggering `UnknownSlot`.
  - `encode_omits_checked_at_when_none` — `None` `checkedAt` isn't written as `nil`; the string `"checkedAt"` doesn't appear anywhere in the output.
  - `encode_allocates_separate_slots_for_distinct_cafs_shapes` — mixed `CafsFileInfo` shapes (some with `checkedAt`, some without) land in separate slots; same shape reuses.
  - `encode_requires_build_when_set` / `encode_omits_requires_build_when_none` — outer-record optional-field handling.
  - `encode_side_effects_roundtrip` — `SideEffectsDiff` with both fields populated.
  - `encode_side_effects_with_only_added_omits_deleted_field` — exact shape the reviewer flagged: `deleted: None` must not appear as a field name in the output.
  - `encode_allocates_separate_slots_for_distinct_side_effects_shapes` — two `SideEffectsDiff`s with distinct shapes land in separate slots.
  - `encode_rejects_manifest_some` — error path for the currently-unimplemented branch.
- [x] `cargo test -p pacquet-cli --test pnpm_compatibility` — new `pnpm_reads_pacquet_written_rows` passes. pacquet installs into a store, `node_modules` is wiped, then `pnpm install --ignore-scripts` reinstalls against the same store. If the record encoding is wrong this reproduces the benchmark's `ERR_PNPM_READ_FROM_STORE`. Complements the existing `same_file_structure` / `same_index_file_contents` which clear the store between the pacquet and pnpm halves and therefore don't exercise this path.
- [x] `cargo clippy --workspace --all-targets -- -D warnings`
- [x] `cargo fmt --all -- --check`
- [x] `cargo doc -p pacquet-store-dir --no-deps` — no warnings.
zkochan added a commit that referenced this pull request Apr 24, 2026
Mirror the tracing change from the index-open call sites: when the
blocking cache-lookup task returns `Err(JoinError)`, emit a
`tracing::warn!` with the error and the `cache_key` before degrading
to `None`. Panic / cancellation stays diagnosable; the caller still
falls through to a fresh download.

Also drop the stale "msgpackr-records decoding is a follow-up"
wording from the cache-lookup rationale — `StoreIndex::get` has
decoded pnpm-written msgpackr-records since 0ea1252 (#251), and
`transcode_to_plain_msgpack` has handled it on the read side since
90be4f6 (#252). Simplify to "an undecodable entry … falls through
to the download path" so the comment reflects today's decoder.
zkochan added a commit that referenced this pull request Apr 24, 2026
…261)

* perf(store-dir): share one read-only StoreIndex across cache lookups

`DownloadTarballToStore::load_cached_cas_paths` previously opened a
fresh SQLite `Connection::open_with_flags` + `PRAGMA busy_timeout` on
every snapshot, so a 1352-package frozen-lockfile install paid 1352
connection-opens just to read cache hints — all serialized on tokio's
blocking pool because `spawn_blocking` can't parallelize opens against
one file.

Open a `StoreIndex` once per install, wrap it in `Arc<Mutex<…>>`
(rusqlite `Connection` is `Send` but not `Sync`), and thread the handle
through `CreateVirtualStore`, `InstallWithoutLockfile`, and
`DownloadTarballToStore`. Each lookup now takes the mutex just for the
sub-millisecond `SELECT`, releases it before the per-file validation
pass, and frees the Arc when the install finishes.

Measured on the #260 repro (1352 packages, release build, macOS APFS,
warm CAS store): `rm -rf node_modules && pacquet install --frozen-
lockfile` drops from ~17.4 s to ~13.9 s, with CPU utilization rising
from ~135 % to ~160 % as the serialized-open bottleneck clears.

* address review feedback

* `create_virtual_store.rs`: bind `store_index.as_ref()` to a local and
  use `async move` so the closure's copy of the `Option<&…>` is moved
  into each future, rather than borrowing from the closure's
  per-iteration scope.
* `load_cached_cas_paths`: treat a poisoned index mutex as a cache miss
  rather than propagating the panic. The `SELECT` is stateless, so the
  prior panic can't have left the index in an inconsistent shape, and
  cache lookups are a best-effort hint — falling through to a fresh
  download is strictly safer than crashing every subsequent snapshot.

* park StoreIndex open on the blocking pool

Copilot flagged that `StoreIndex::shared_readonly_in` does synchronous
SQLite work (`Connection::open_with_flags` + `PRAGMA busy_timeout`) and
calling it directly in an `async fn` stalls the tokio reactor thread
for the duration of the open. It's once per install, not a hot spot,
but matches the pattern the store-index writer already uses and costs
us nothing to keep consistent.

Wrap both call sites (frozen-lockfile and no-lockfile install paths)
in `tokio::task::spawn_blocking` and await the result before kicking
off concurrent package work. Also moves `tokio` from `dev-dependencies`
into `dependencies` in package-manager's `Cargo.toml` since it's now a
runtime dep for the library code, not just tests.

* degrade JoinError on StoreIndex open to a cache miss

`spawn_blocking(...).await` returns `Result<_, JoinError>` which fires
on blocking-task panic or on cancellation during runtime shutdown.
Panicking on that turns a transient runtime issue into a hard crash
for an install that could otherwise succeed — cache lookups just miss.

Drop the `.expect(...)` and degrade to `None` via `.ok().flatten()`.
`shared_readonly_in` already yields `None` for a first-time install
against an empty store, and every downstream caller already handles
that shape, so this is the same failure mode the code already copes
with — just reached from a different source.

* use async move in install_dependencies_from_registry map closure

Makes the recursive no-lockfile closure's captures explicit: bind
`&node_modules_path` to a local `node_modules_path_ref` (Copy) and
switch to `async move` so every captured variable is moved (copied,
for all the `&T` and `Option<&T>` captures) into the future instead
of borrowing the closure's per-iteration stack frame.

Matches the pattern already used in `create_virtual_store.rs` and in
the outer no-lockfile closure — keeps all three parallel-map sites
consistent.

* log JoinError before degrading to cache miss on StoreIndex open

The silent `.ok().flatten()` hid two failure modes behind a cache miss:
a panic inside the blocking task, and task cancellation during runtime
shutdown. Neither is a crash-worthy error, but a blocking-task panic
during install really should be visible somewhere.

Promote the two call sites to a `match` and emit a `tracing::warn!` on
the `Err(JoinError)` branch before falling back to `None`. Uses the
`pacquet::install` target in common with the rest of this crate's
tracing. Degradation path is unchanged.

* chore: cargo fmt

* log JoinError in tarball cache lookup, refresh stale comment

Mirror the tracing change from the index-open call sites: when the
blocking cache-lookup task returns `Err(JoinError)`, emit a
`tracing::warn!` with the error and the `cache_key` before degrading
to `None`. Panic / cancellation stays diagnosable; the caller still
falls through to a fresh download.

Also drop the stale "msgpackr-records decoding is a follow-up"
wording from the cache-lookup rationale — `StoreIndex::get` has
decoded pnpm-written msgpackr-records since 0ea1252 (#251), and
`transcode_to_plain_msgpack` has handled it on the read side since
90be4f6 (#252). Simplify to "an undecodable entry … falls through
to the download path" so the comment reflects today's decoder.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adopt pnpm v11 store format (SQLite-backed) so pacquet can share a store with pnpm

2 participants