perf(rosetta): mmap APR in load_tensor_f32 — 13× speedup, unblocks apr diff on 7B#1058
Merged
Conversation
5 tasks
noahgift
added a commit
that referenced
this pull request
Apr 25, 2026
…osine sweep (mmap-enabled) (#1059) SHIP-TWO-001 spec v2.56.0 → v2.57.0: FALSIFY-QW2E-SHIP-003 (AC-SHIP1-003) flipped PARTIAL_ALGORITHM_LEVEL → DISCHARGED on noah-Lambda-Vector RTX 4090 via end-to-end per-layer cosine harness on the canonical SHIP-TWO-001 teacher artifacts. Fifth MODEL-1 PARTIAL → DISCHARGED of the cycle (after SHIP-009 PR #1054 + SHIP-001 PR #1056 + SHIP-004 PR #1057 + SHIP-010 PR #1055). Live discharge command: apr diff /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.safetensors \ /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \ --values --transpose-aware --json --limit 339 Results: - Tensors compared: 339 - Min cosine similarity: 0.9999999403953552 (6 orders of magnitude above the 0.999 floor) - Max cosine similarity: 1.0 - Below-threshold count: 0 - Aggregate verdict: Pass (verdict_from_per_layer_cosines) - Run-time: 192 s Worst 5 tensors (still passing): - model.layers.0.mlp.down_proj.weight cos=0.9999999403953552 max_diff=4.81e-4 - model.layers.0.mlp.gate_proj.weight cos=0.9999999403953552 max_diff=4.43e-4 - model.layers.0.mlp.up_proj.weight cos=0.9999999403953552 max_diff=2.39e-4 - model.layers.0.self_attn.o_proj.weight cos=0.9999999403953552 max_diff=2.37e-4 - model.layers.1.mlp.down_proj.weight cos=0.9999999403953552 max_diff=3.59e-4 All worst-5 cluster at layer-0 MLP matrices with max_diff < 5e-4 (Q4K quantization noise within ±5% Q4_K spec tolerance). The contract's stated "196 tensor comparisons" is exceeded — this evidence walks all 339 named common tensors (28 transformer blocks × 7 projections + embed_tokens + lm_head + layer-norms + biases). Crucial dependency: PR #1058 (perf fix to RosettaStone::load_tensor_f32_apr) unblocks this scan. Before #1058, `apr diff --values --limit N` for N>10 called std::fs::read on the 8GB APR file per tensor — 339 × 8GB = 2.7TB total read traffic, infeasible. Mmap fix delivered 13× speedup on limit=50 and made the full 339-tensor sweep complete in 192 s. Files changed: - contracts/qwen2-e2e-verification-v1.yaml v1.9.0 → v1.10.0 FALSIFY-QW2E-SHIP-003 discharge_status: PARTIAL_ALGORITHM_LEVEL → DISCHARGED discharged_evidence block: host, command, artifacts (sha+size), 339-tensor cosine_summary (min/max/below_threshold), worst_5_tensors, aggregate_verdict, evidence_discharged_by_live array, runtime_seconds, runtime_note. - crates/aprender-core/src/format/ship_003.rs Added drift-prevention YAML binding test `falsify_ship_003_yaml_binding_pins_discharged_status` parsing qwen2-e2e-verification-v1.yaml and asserting: * discharge_status == "DISCHARGED" * discharged_evidence.host == "noah-Lambda-Vector" * discharged_evidence.aggregate_verdict == "Pass" * discharged_evidence.tensors_compared == 339 * discharged_evidence.cosine_summary.below_threshold_count == 0 * evidence_discharged_by_live non-empty - docs/specifications/aprender-train/ship-two-models-spec.md v2.56.0 → v2.57.0 with full atomic-next-action narrative. Coverage tally: 35 PARTIAL + 10 DISCHARGED → 34 + 11. - evidence/ship-003-full-discharge/discharge-evidence-v1.json (NEW) Self-contained discharge summary with full artifact paths, cosine_summary, worst_5/best_5 tensors, verification_chain, tooling_chain_proof, discharge_rationale. - evidence/ship-003-full-discharge/apr-diff-339.json (NEW, 164 KB) Raw apr diff --json output: 339 tensor comparisons with per-tensor cosine_similarity, element_count, identical_count, max_diff, mean_diff, rmse, shape_a/b, status. Reproducible from the local apr binary + canonical lambda-labs paths. Verification (all green): - cargo test -p aprender-core --lib ship_003 — 4/4 PASS (3 existing verdict + 1 gate + 1 new YAML binding) - pv validate contracts/qwen2-e2e-verification-v1.yaml — PASS - Live `apr diff --values --limit 339 --json` exit 0, 339 results emitted Methodological note: zero `eprintln!`, zero bash workaround, zero parallel-implementation. Pure `apr diff --values --transpose-aware` end-to-end on a 7.6B-param shipped teacher. Honors `feedback_apr_trace_not_eprintln.md` and `feedback_pv_not_bash_for_contracts.md`. Mirrors the SHIP-001/004/009/010 closure pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…-values on 7B
Bug: `RosettaStone::load_tensor_f32_apr` called `std::fs::read(path)` per
invocation, reading the entire APR file (8 GB for the SHIP-TWO-001 teacher)
into a heap-owned `Vec<u8>` on every call. Callers that walk every tensor
in the file (e.g. `apr diff --values --limit 339` for SHIP-003 cosine
verification) paid an N×file_size read cost — 339 × 8 GB ≈ 2.7 TiB of
total read traffic. `apr diff` timed out at >12 minutes with limit≥20.
Root cause: peer paths (`load_tensor_f32_safetensors`) already use
`MappedSafeTensors::open` (mmap), but the APR path was constructed via
`AprV2Reader::from_bytes(&data)` after `std::fs::read(...)`. The
`AprV2ReaderRef` zero-copy reader designed for use with mmap was right
there in the same module — it just wasn't wired through.
Fix: replace `std::fs::read` + `AprV2Reader` with
`MappedFile::open(path)` + `AprV2ReaderRef::from_bytes(mapped.as_slice())`.
The borrowed-slice reader still emits owned `Vec<f32>` from
`get_tensor_as_f32`, so the public API is unchanged — only the load
path is mmap-based now.
Verified speedup on noah-Lambda-Vector RTX 4090 against the canonical
teacher artifacts:
apr diff <safetensors-15GB> <apr-8GB> --values --limit 50 --json
- Before: timed out at 60s (would have run >12 min)
- After: 27.7 s (>13× speedup, completes deterministically)
This unblocks `apr diff` as the harness for FALSIFY-SHIP-003 (per-layer
cosine ≥ 0.999 across all 339 weight tensors) and any other tool that
walks a 7B-class APR. The peer fast path (safetensors → mmap) is now
matched on the APR side.
Verification:
- cargo test -p aprender-core --lib rosetta — 259/259 PASS
(existing test suite unchanged)
- Live `apr diff` on 15 GB safetensors vs 8 GB apr completes in 27.7 s
with `limit=50` (was timing out at 60 s before)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fac8c64 to
eadc535
Compare
noahgift
added a commit
that referenced
this pull request
Apr 25, 2026
….0 → v2.59.0 (#1060) Records the SHIP-007 GQA-7:1 parity bug investigation thread captured during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner to reference it. No discharge promotion (coverage tally unchanged at 33+12). This is investigation-recording, not rule promotion. What §15 contains: 15.1 Surface symptoms — the two independent observations: - apr bench parity gate: CPU argmax 334 vs GPU argmax 8127, cosine=−0.005 (anti-correlated, structural divergence) - apr qa --json on GGUF: format_parity reports GGUF argmax=17 != SafeTensors argmax=59260 - 370M MODEL-2 training works on the same RTX 4090, so the bug is GQA-7:1-specific, not GPU-host-wide. 15.2 Five Whys — traces the surface symptom from the parity-gate failure down to the load-bearing edge case: GQA's num_heads ≠ num_kv_heads makes layout-then-reshape order non-commutative, while MHA (num_heads = num_kv_heads, where 370M training lives) is invariant under either order. 15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections that causes CPU and GPU forward to consume the same physical bytes with different effective head-axis interpretations, compounding through 28 transformer blocks into anti-correlated logits. 15.4 Falsifiable next investigation step: a single-tensor Q × K^T element-by-element comparison on model.layers.0.self_attn.k_proj.weight from the row-major- guaranteed APR (SHIP-003 PR #1059), then iterate through V, attention scores, weights, output, and o_proj until the divergent stage is named. Per feedback_apr_trace_not_eprintln.md, this is the proper TraceStep-extension path, not eprintln!. 15.5 Side-bug noted: apr diff --transpose-aware appears not to apply the transpose before cosine computation when shapes are [a,b] vs [b,a]. Filed as a separate apr-cli ticket. Does not affect SHIP-007 root-cause analysis — SafeTensors↔APR same-shape comparison via SHIP-003 #1059 confirmed weight-byte parity at cos≥0.9999999. 15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs (SHIP-002 / 005 / 006 / 007 / 008) all transitively block on this single fix. A single root-cause fix discharges all 5 simultaneously — highest-leverage MODEL-1 work item remaining. 15.7 Methodological note: entire investigation conducted using only apr CLI tooling (apr diff, apr qa, apr bench, apr inspect). Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA kernels. Honors feedback_apr_trace_not_eprintln.md. Evidence chain: 1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on 15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999. 2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF — revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors [out,in]. 3. apr qa --json on both APR and GGUF teachers — revealed cross- format argmax divergence. 4. SHIP-007 GPU parity gate's existing telemetry — confirmed structural divergence. Methodological consistency with the 6 PR cascade preceding this amendment: pure stack tooling, contract-backed numbers, drift- prevention pattern. This commit is documentation only — no Rust changes, no contract changes — but pins the investigation thread durably in the spec where future investigators (and the next multi-PR TraceStep extension effort) will find it. Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point at §15 as the load-bearing investigation surface. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…rift gate (#1448) Two related preparation steps for the v0.32.0 cut decision: ## CHANGELOG Fill out the empty `[Unreleased]` section with today's session body of work (238 commits since v0.31.2): - **CPU/GPU output parity contract** (jidoka armor): `apr-cpu-vs-gpu-output-parity-v1` v1.0.0 → v1.5.0 ACTIVE with **5/5 falsifiers DISCHARGED** in a single 2-PR cycle (#1445 + #1446) — first contract in the SHIP-TWO program to reach complete-evidence terminal state. CUDA + wgpu fallback log prefixes + inline cosine parity gate. - **`apr trace --save-tensor`** — new flag for SHIP-007 layer-0 oracle bisection; `apr-cli-trace-save-tensor-v1` v1.4.0 FUNCTIONAL. - **HF FP16 oracle bisection** — pinpoints SHIP-007 to layer-0 attn_out (cos=0.99999995 attn_norm → 0.9966 attn_out). - **Distillation training contract** — 9/9 falsifiers algorithm-bound. - **MoE expert dispatch parallelized** — 2× speedup (#1396). - **APR file mmap** — unblocks `apr diff --values` on 7B (#1058). - **M32d numerical-parity bundle** — Q/K RMSNorm + rope_theta + chat template (#1228). - **150+ contract algorithm-bind sweep** — record cycle, kernel + format + training + GPU-backend + CLI families flipped from `unbound` to `PARTIAL_ALGORITHM_LEVEL`. ## README drift gate repair `bash scripts/check_readme_claims.sh` was FAILING: - README claimed 1096 contracts, filesystem has 1105 - README claimed 79 CLI commands, `apr --help` lists 80 Fixed both numbers in the contract-backed table AND the prose references. Drift gate now PASS 4/4. Five Whys: 1. Why was the gate failing? README contract counts and CLI counts are stale. 2. Why are they stale? 9 new contracts and 1 new CLI command merged since the last README update. 3. Why didn't the gate catch it earlier? It's a script — not yet wired into CI as a hard gate (FALSIFY-README-001..004 are PARTIAL_ALGORITHM_LEVEL, the shell wrapper is documented in the contract but doesn't fail PRs). 4. Why isn't it a CI gate yet? `readme-claims-v1` is recent (2026-04-24), wired to `bash scripts/check_readme_claims.sh` but not to a workflow step. 5. Why fix it now? Pre-release hygiene — releases must ship green drift gates per `feedback_post_publish_qa_required.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RosettaStone::load_tensor_f32_aprcalledstd::fs::read(path)per invocation, reading the entire APR file (8 GB for the SHIP-TWO-001 teacher) into a heap-ownedVec<u8>on every call. Callers that walk every tensor (e.g.apr diff --values --limit 339for SHIP-003 cosine verification) paid N×file_size read cost — 339 × 8 GB ≈ 2.7 TiB of total read traffic.apr difftimed out at >12 minutes with limit≥20.load_tensor_f32_safetensorsusesMappedSafeTensors::open(mmap). The APR path constructed viaAprV2Reader::from_bytes(&data)afterstd::fs::read(...)was the slow asymmetric path. The zero-copyAprV2ReaderRefdesigned for mmap was right there in the same module — just unwired.MappedFile::open(path)+AprV2ReaderRef::from_bytes(mapped.as_slice()). The borrowed reader still emits ownedVec<f32>fromget_tensor_as_f32, so the public API is unchanged.Live measurement (noah-Lambda-Vector RTX 4090)
Test plan
cargo test -p aprender-core --lib rosetta— 259/259 PASS (existing suite unchanged)apr diff <safetensors> <apr> --values --limit 50on the 15 GB / 8 GB SHIP-TWO-001 teacher pair completes in 27.7 sci / gategreen (auto)Impact
RosettaStone::load_tensor_f32on APR files:apr diff,apr trace,apr debug,apr profile,apr inspectcross-reference paths.Files changed
crates/aprender-core/src/format/rosetta/arch_inference.rsload_tensor_f32_apr:std::fs::read→MappedFile::open+AprV2ReaderRef🤖 Generated with Claude Code