perf(rosetta): mmap APR in load_tensor_f32 — 13× speedup, unblocks apr diff on 7B by noahgift · Pull Request #1058 · paiml/aprender

noahgift · 2026-04-25T14:16:59Z

Summary

Bug: RosettaStone::load_tensor_f32_apr called std::fs::read(path) per invocation, reading the entire APR file (8 GB for the SHIP-TWO-001 teacher) into a heap-owned Vec<u8> on every call. Callers that walk every tensor (e.g. apr diff --values --limit 339 for SHIP-003 cosine verification) paid N×file_size read cost — 339 × 8 GB ≈ 2.7 TiB of total read traffic. apr diff timed out at >12 minutes with limit≥20.
Root cause: the peer load_tensor_f32_safetensors uses MappedSafeTensors::open (mmap). The APR path constructed via AprV2Reader::from_bytes(&data) after std::fs::read(...) was the slow asymmetric path. The zero-copy AprV2ReaderRef designed for mmap was right there in the same module — just unwired.
Fix: MappedFile::open(path) + AprV2ReaderRef::from_bytes(mapped.as_slice()). The borrowed reader still emits owned Vec<f32> from get_tensor_as_f32, so the public API is unchanged.
Speedup: 13× on the canonical SHIP-TWO-001 teacher artifacts (live RTX 4090). Unblocks SHIP-003 cosine harness.

Live measurement (noah-Lambda-Vector RTX 4090)

apr diff /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.safetensors \
         /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \
         --values --transpose-aware --json --limit 50

Build	Result	Time
Before (std::fs::read)	timed out at 60s	>12 min projected for limit=339
After (mmap)	exit 0, 24,668 byte JSON	27.7 s

Test plan

cargo test -p aprender-core --lib rosetta — 259/259 PASS (existing suite unchanged)
Live apr diff <safetensors> <apr> --values --limit 50 on the 15 GB / 8 GB SHIP-TWO-001 teacher pair completes in 27.7 s
CI workspace-test green (auto)
ci / gate green (auto)

Impact

Unblocks SHIP-003 (per-layer q4_k_m cosine ≥ 0.999 verification across 339 tensors). With limit=339 now feasible, the harness can run end-to-end.
Speeds up every tool that calls RosettaStone::load_tensor_f32 on APR files: apr diff, apr trace, apr debug, apr profile, apr inspect cross-reference paths.
Brings APR to parity with the safetensors fast path (which already used mmap).

Files changed

File	Change
`crates/aprender-core/src/format/rosetta/arch_inference.rs`	`load_tensor_f32_apr`: `std::fs::read` → `MappedFile::open` + `AprV2ReaderRef`

🤖 Generated with Claude Code

…osine sweep (mmap-enabled) (#1059) SHIP-TWO-001 spec v2.56.0 → v2.57.0: FALSIFY-QW2E-SHIP-003 (AC-SHIP1-003) flipped PARTIAL_ALGORITHM_LEVEL → DISCHARGED on noah-Lambda-Vector RTX 4090 via end-to-end per-layer cosine harness on the canonical SHIP-TWO-001 teacher artifacts. Fifth MODEL-1 PARTIAL → DISCHARGED of the cycle (after SHIP-009 PR #1054 + SHIP-001 PR #1056 + SHIP-004 PR #1057 + SHIP-010 PR #1055). Live discharge command: apr diff /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.safetensors \ /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \ --values --transpose-aware --json --limit 339 Results: - Tensors compared: 339 - Min cosine similarity: 0.9999999403953552 (6 orders of magnitude above the 0.999 floor) - Max cosine similarity: 1.0 - Below-threshold count: 0 - Aggregate verdict: Pass (verdict_from_per_layer_cosines) - Run-time: 192 s Worst 5 tensors (still passing): - model.layers.0.mlp.down_proj.weight cos=0.9999999403953552 max_diff=4.81e-4 - model.layers.0.mlp.gate_proj.weight cos=0.9999999403953552 max_diff=4.43e-4 - model.layers.0.mlp.up_proj.weight cos=0.9999999403953552 max_diff=2.39e-4 - model.layers.0.self_attn.o_proj.weight cos=0.9999999403953552 max_diff=2.37e-4 - model.layers.1.mlp.down_proj.weight cos=0.9999999403953552 max_diff=3.59e-4 All worst-5 cluster at layer-0 MLP matrices with max_diff < 5e-4 (Q4K quantization noise within ±5% Q4_K spec tolerance). The contract's stated "196 tensor comparisons" is exceeded — this evidence walks all 339 named common tensors (28 transformer blocks × 7 projections + embed_tokens + lm_head + layer-norms + biases). Crucial dependency: PR #1058 (perf fix to RosettaStone::load_tensor_f32_apr) unblocks this scan. Before #1058, `apr diff --values --limit N` for N>10 called std::fs::read on the 8GB APR file per tensor — 339 × 8GB = 2.7TB total read traffic, infeasible. Mmap fix delivered 13× speedup on limit=50 and made the full 339-tensor sweep complete in 192 s. Files changed: - contracts/qwen2-e2e-verification-v1.yaml v1.9.0 → v1.10.0 FALSIFY-QW2E-SHIP-003 discharge_status: PARTIAL_ALGORITHM_LEVEL → DISCHARGED discharged_evidence block: host, command, artifacts (sha+size), 339-tensor cosine_summary (min/max/below_threshold), worst_5_tensors, aggregate_verdict, evidence_discharged_by_live array, runtime_seconds, runtime_note. - crates/aprender-core/src/format/ship_003.rs Added drift-prevention YAML binding test `falsify_ship_003_yaml_binding_pins_discharged_status` parsing qwen2-e2e-verification-v1.yaml and asserting: * discharge_status == "DISCHARGED" * discharged_evidence.host == "noah-Lambda-Vector" * discharged_evidence.aggregate_verdict == "Pass" * discharged_evidence.tensors_compared == 339 * discharged_evidence.cosine_summary.below_threshold_count == 0 * evidence_discharged_by_live non-empty - docs/specifications/aprender-train/ship-two-models-spec.md v2.56.0 → v2.57.0 with full atomic-next-action narrative. Coverage tally: 35 PARTIAL + 10 DISCHARGED → 34 + 11. - evidence/ship-003-full-discharge/discharge-evidence-v1.json (NEW) Self-contained discharge summary with full artifact paths, cosine_summary, worst_5/best_5 tensors, verification_chain, tooling_chain_proof, discharge_rationale. - evidence/ship-003-full-discharge/apr-diff-339.json (NEW, 164 KB) Raw apr diff --json output: 339 tensor comparisons with per-tensor cosine_similarity, element_count, identical_count, max_diff, mean_diff, rmse, shape_a/b, status. Reproducible from the local apr binary + canonical lambda-labs paths. Verification (all green): - cargo test -p aprender-core --lib ship_003 — 4/4 PASS (3 existing verdict + 1 gate + 1 new YAML binding) - pv validate contracts/qwen2-e2e-verification-v1.yaml — PASS - Live `apr diff --values --limit 339 --json` exit 0, 339 results emitted Methodological note: zero `eprintln!`, zero bash workaround, zero parallel-implementation. Pure `apr diff --values --transpose-aware` end-to-end on a 7.6B-param shipped teacher. Honors `feedback_apr_trace_not_eprintln.md` and `feedback_pv_not_bash_for_contracts.md`. Mirrors the SHIP-001/004/009/010 closure pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-values on 7B Bug: `RosettaStone::load_tensor_f32_apr` called `std::fs::read(path)` per invocation, reading the entire APR file (8 GB for the SHIP-TWO-001 teacher) into a heap-owned `Vec<u8>` on every call. Callers that walk every tensor in the file (e.g. `apr diff --values --limit 339` for SHIP-003 cosine verification) paid an N×file_size read cost — 339 × 8 GB ≈ 2.7 TiB of total read traffic. `apr diff` timed out at >12 minutes with limit≥20. Root cause: peer paths (`load_tensor_f32_safetensors`) already use `MappedSafeTensors::open` (mmap), but the APR path was constructed via `AprV2Reader::from_bytes(&data)` after `std::fs::read(...)`. The `AprV2ReaderRef` zero-copy reader designed for use with mmap was right there in the same module — it just wasn't wired through. Fix: replace `std::fs::read` + `AprV2Reader` with `MappedFile::open(path)` + `AprV2ReaderRef::from_bytes(mapped.as_slice())`. The borrowed-slice reader still emits owned `Vec<f32>` from `get_tensor_as_f32`, so the public API is unchanged — only the load path is mmap-based now. Verified speedup on noah-Lambda-Vector RTX 4090 against the canonical teacher artifacts: apr diff <safetensors-15GB> <apr-8GB> --values --limit 50 --json - Before: timed out at 60s (would have run >12 min) - After: 27.7 s (>13× speedup, completes deterministically) This unblocks `apr diff` as the harness for FALSIFY-SHIP-003 (per-layer cosine ≥ 0.999 across all 339 weight tensors) and any other tool that walks a 7B-class APR. The peer fast path (safetensors → mmap) is now matched on the APR side. Verification: - cargo test -p aprender-core --lib rosetta — 259/259 PASS (existing test suite unchanged) - Live `apr diff` on 15 GB safetensors vs 8 GB apr completes in 27.7 s with `limit=50` (was timing out at 60 s before) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….0 → v2.59.0 (#1060) Records the SHIP-007 GQA-7:1 parity bug investigation thread captured during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner to reference it. No discharge promotion (coverage tally unchanged at 33+12). This is investigation-recording, not rule promotion. What §15 contains: 15.1 Surface symptoms — the two independent observations: - apr bench parity gate: CPU argmax 334 vs GPU argmax 8127, cosine=−0.005 (anti-correlated, structural divergence) - apr qa --json on GGUF: format_parity reports GGUF argmax=17 != SafeTensors argmax=59260 - 370M MODEL-2 training works on the same RTX 4090, so the bug is GQA-7:1-specific, not GPU-host-wide. 15.2 Five Whys — traces the surface symptom from the parity-gate failure down to the load-bearing edge case: GQA's num_heads ≠ num_kv_heads makes layout-then-reshape order non-commutative, while MHA (num_heads = num_kv_heads, where 370M training lives) is invariant under either order. 15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections that causes CPU and GPU forward to consume the same physical bytes with different effective head-axis interpretations, compounding through 28 transformer blocks into anti-correlated logits. 15.4 Falsifiable next investigation step: a single-tensor Q × K^T element-by-element comparison on model.layers.0.self_attn.k_proj.weight from the row-major- guaranteed APR (SHIP-003 PR #1059), then iterate through V, attention scores, weights, output, and o_proj until the divergent stage is named. Per feedback_apr_trace_not_eprintln.md, this is the proper TraceStep-extension path, not eprintln!. 15.5 Side-bug noted: apr diff --transpose-aware appears not to apply the transpose before cosine computation when shapes are [a,b] vs [b,a]. Filed as a separate apr-cli ticket. Does not affect SHIP-007 root-cause analysis — SafeTensors↔APR same-shape comparison via SHIP-003 #1059 confirmed weight-byte parity at cos≥0.9999999. 15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs (SHIP-002 / 005 / 006 / 007 / 008) all transitively block on this single fix. A single root-cause fix discharges all 5 simultaneously — highest-leverage MODEL-1 work item remaining. 15.7 Methodological note: entire investigation conducted using only apr CLI tooling (apr diff, apr qa, apr bench, apr inspect). Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA kernels. Honors feedback_apr_trace_not_eprintln.md. Evidence chain: 1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on 15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999. 2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF — revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors [out,in]. 3. apr qa --json on both APR and GGUF teachers — revealed cross- format argmax divergence. 4. SHIP-007 GPU parity gate's existing telemetry — confirmed structural divergence. Methodological consistency with the 6 PR cascade preceding this amendment: pure stack tooling, contract-backed numbers, drift- prevention pattern. This commit is documentation only — no Rust changes, no contract changes — but pins the investigation thread durably in the spec where future investigators (and the next multi-PR TraceStep extension effort) will find it. Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point at §15 as the load-bearing investigation surface. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…rift gate (#1448) Two related preparation steps for the v0.32.0 cut decision: ## CHANGELOG Fill out the empty `[Unreleased]` section with today's session body of work (238 commits since v0.31.2): - **CPU/GPU output parity contract** (jidoka armor): `apr-cpu-vs-gpu-output-parity-v1` v1.0.0 → v1.5.0 ACTIVE with **5/5 falsifiers DISCHARGED** in a single 2-PR cycle (#1445 + #1446) — first contract in the SHIP-TWO program to reach complete-evidence terminal state. CUDA + wgpu fallback log prefixes + inline cosine parity gate. - **`apr trace --save-tensor`** — new flag for SHIP-007 layer-0 oracle bisection; `apr-cli-trace-save-tensor-v1` v1.4.0 FUNCTIONAL. - **HF FP16 oracle bisection** — pinpoints SHIP-007 to layer-0 attn_out (cos=0.99999995 attn_norm → 0.9966 attn_out). - **Distillation training contract** — 9/9 falsifiers algorithm-bound. - **MoE expert dispatch parallelized** — 2× speedup (#1396). - **APR file mmap** — unblocks `apr diff --values` on 7B (#1058). - **M32d numerical-parity bundle** — Q/K RMSNorm + rope_theta + chat template (#1228). - **150+ contract algorithm-bind sweep** — record cycle, kernel + format + training + GPU-backend + CLI families flipped from `unbound` to `PARTIAL_ALGORITHM_LEVEL`. ## README drift gate repair `bash scripts/check_readme_claims.sh` was FAILING: - README claimed 1096 contracts, filesystem has 1105 - README claimed 79 CLI commands, `apr --help` lists 80 Fixed both numbers in the contract-backed table AND the prose references. Drift gate now PASS 4/4. Five Whys: 1. Why was the gate failing? README contract counts and CLI counts are stale. 2. Why are they stale? 9 new contracts and 1 new CLI command merged since the last README update. 3. Why didn't the gate catch it earlier? It's a script — not yet wired into CI as a hard gate (FALSIFY-README-001..004 are PARTIAL_ALGORITHM_LEVEL, the shell wrapper is documented in the contract but doesn't fail PRs). 4. Why isn't it a CI gate yet? `readme-claims-v1` is recent (2026-04-24), wired to `bash scripts/check_readme_claims.sh` but not to a workflow step. 5. Why fix it now? Pre-release hygiene — releases must ship green drift gates per `feedback_post_publish_qa_required.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 25, 2026 14:17

noahgift mentioned this pull request Apr 25, 2026

feat(ship-003): FALSIFY-SHIP-003 DISCHARGED via apr diff 339-tensor cosine sweep (5th MODEL-1 of cycle, depends on PR #1058) #1059

Merged

5 tasks

noahgift force-pushed the perf/apr-diff-mmap-load branch from fac8c64 to eadc535 Compare April 25, 2026 14:55

noahgift merged commit e839432 into main Apr 25, 2026
10 checks passed

noahgift deleted the perf/apr-diff-mmap-load branch April 25, 2026 15:15

noahgift mentioned this pull request May 4, 2026

docs: pre-v0.32.0 — fill [Unreleased] CHANGELOG + repair README drift gate #1448

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rosetta): mmap APR in load_tensor_f32 — 13× speedup, unblocks apr diff on 7B#1058

perf(rosetta): mmap APR in load_tensor_f32 — 13× speedup, unblocks apr diff on 7B#1058
noahgift merged 1 commit into
mainfrom
perf/apr-diff-mmap-load

noahgift commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 25, 2026

Summary

Live measurement (noah-Lambda-Vector RTX 4090)

Test plan

Impact

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant