feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors by noahgift · Pull Request #1413 · paiml/aprender

noahgift · 2026-05-03T07:36:15Z

Summary

Closes the apr_diff_values_compat invariant of apr-cli-trace-save-tensor-v1 at PARTIAL_ALGORITHM_LEVEL.
New diff_05_aprt_stage.rs include slot: when both inputs to apr diff --values start with magic bytes APRT, dispatch bypasses the whole-model RosettaStone walker and runs an element-wise stage-tensor diff (max|diff|, RMS, cosine sim, top-K).
Mismatched dim_product or layer → fail-fast error (no silent compare of incompatible stages).
Contract apr-cli-trace-save-tensor-v1 v1.0.0 → v1.1.0 with new FALSIFY-APR-TRACE-SAVE-009.

Why now

The SHIP-007 PR-A→B→C cascade for MODEL-1 layer-0 stage-by-stage element-wise bisection (per feedback_model_1_ships_gpu_only.md) needs PR-D infrastructure ready in parallel with PR #1408 (PR-C-real step 1). PR-D is CLI-only — no dependency on forward_traced threading — so it merges independently.

apr trace --save-tensor writes APRT-prefixed per-stage f32 tensors. Without this PR, callers must either parse the 12-byte header by hand or shell out to a Python script — exactly the kind of muda APR-MONO §26.8 forbids.

What changed

crates/apr-cli/src/commands/diff_05_aprt_stage.rs (new): is_aprt_stage_file, compute_aprt_stage_stats, run_aprt_stage_diff + 11 unit tests (provenance pin, magic detection, stats correctness, error cases).
crates/apr-cli/src/commands/diff.rs: detect APRT magic on both --values inputs and dispatch before the RosettaStone path; legacy callers (model-vs-model diff) unchanged.
contracts/apr-cli-trace-save-tensor-v1.yaml: v1.0.0 → v1.1.0; new FALSIFY-APR-TRACE-SAVE-009 with algorithm_evidence citing the new unit tests.

Test plan

cargo test -p apr-cli --lib commands::diff::aprt → 11/11 PASS
cargo clippy -p apr-cli --lib --no-deps -- -D warnings clean
pv validate contracts/apr-cli-trace-save-tensor-v1.yaml → 0 errors
CI required checks (ci / gate, workspace-test)

Five Whys

Why now? SHIP-007 cascade needs PR-D ready when PR-C-real step 2 lands.
Why extend apr diff instead of new subcommand? Contract apr_diff_values_compat already names apr diff --values as the verifier.
Why an include!() file? diff.rs follows that pattern (diff_accumulator, diff_output_json_text, diff_04).
Why no live integration smoke? The infrastructure for end-to-end live (apr trace --save-tensor X.bin) requires SHIP-007 PR-C-real step 2 (feat(aprender-serve): SHIP-007 PR-C-real step 1 — forward_traced_with_save_tensor wrapper #1408 stacked) to be merged. The unit tests pin the byte-format contract via synthetic APRT fixtures, which is sufficient at PARTIAL_ALGORITHM_LEVEL per the contract's own discharge ladder.
Why dogfood realizar::inference_trace::save_tensor::read_tensor_file instead of inline parsing? Reusing the same parser the writer uses is the canonical way to prevent format drift. apr-cli already imports from realizar via the default inference feature.

Ship % update

MODEL-1: ~64% → ~66% (1 invariant DISCHARGED-at-algorithm; infrastructure clear for SHIP-007 step E live diffing).
MODEL-2: full Stack v1.2 Python corpus tokenization running in background (~33h ETA).

🤖 Generated with Claude Code

…age tensors Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… `forward_traced_with_save_tensor` (#1414) Extends the wrapper from PR-C-real step 1 (#1408) to additionally write the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`] selects it. The logits are pulled directly from `trace.logits` — the `Vec<f32>` already returned by `forward_traced` — so no recompute, no internal forward-pass surgery, no risk of behavior drift. This is the same low-risk capture pattern as step 1's `Embedding` branch (re-use already-computed data; defer the high-risk threading into `forward_traced` to future steps). ## Five Whys (why now, why this scope) 1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two are externally re-extractable from a `forward_traced` return value without modifying the function body: `Embedding` (cheap re-call of `self.embed`) and `LmHead` (already in `trace.logits`). Step 1 shipped Embedding; LmHead is the obvious second. 2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those stages require threading `Option<&SaveTensorPlan>` through the 360-line `forward_traced` body. That's the bigger surgery — high blast radius, deserves its own PR with proper drift-prevention tests and a real-model integration smoke. Splitting LmHead out first lets `apr diff --values` (PR #1413) compare APR vs GGUF logits TODAY for free, before per-layer infrastructure lands. 3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field so `apr diff --values` can recognize them. Mirrors the existing `output_path_whole_model_no_layer_segment` test in `save_tensor_paths.rs`. 4. **Why no integration test on a real `AprTransformer`?** Loading a real APR model is heavyweight; the wrapper's logic is just three plan-API calls + a write. The 4 new pin tests in `traced_save_tensor_step2_tests` simulate the byte-flow at the contract level (path + header + body + skip-when-unselected). Live discharge against the canonical 7B teacher is left to SHIP-007 PR-E (the actual layer-0 bisection PR). 5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in the merge queue. With both of those landed, the next-best lever for the operator-ratified "MODEL-1 ships GPU only via SHIP-007 layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`) is to expand `forward_traced_with_save_tensor`'s capture surface one stage at a time. LmHead is the smallest, safest next step. ## Verification - `cargo test -p aprender-serve --lib traced_save_tensor_step2` → 4/4 PASS: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving) - `cargo check -p aprender-serve --lib` clean - Step 1's existing Embedding branch is byte-identical to before (no edits to that block; only added a sibling LmHead branch). ## Contract Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (which is mid-merge and bumps `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE- SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up. ## Ship % update - MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to 2/18 stages; the two cheapest captures are now wired). - MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090 development host). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… records LmHead step-2 PARTIAL discharge (#1415) Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four new pin tests in `traced_save_tensor_step2_tests`: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving) `binds_to: byte_format` because step 2 invokes the same write_tensor_file path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format` equation specifies. Live discharge against the canonical 7B teacher is deferred to SHIP-007 PR-E (layer-0 bisection). ## Five Whys 1. **Why a separate contract follow-up?** The PR #1414 commit needed to land before this bump to avoid file-conflict with PR #1413 (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009). 2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper doesn't add a new clap surface (PR-A already did that); it adds a new branch that emits files conforming to the existing byte-format equation. The new branch's verbatim f32 LE round-trip + NaN preservation is exactly the property `byte_format` invariants pin. 3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests simulate the wrapper's byte-flow at the contract level using synthetic plans and fake logits — they do NOT instantiate a full AprTransformer or load a real APR model. Live discharge requires SHIP-007 PR-E. 4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010) that binds an additional invariant is a minor schema change. Per semver, that's a minor bump. 5. **Why `pv validate` clean even with two new falsifiers in 24h?** The contract uses metadata.kind=schema, so falsification_tests entries are flexible; pv validates structure, IDs are unique, and binds_to references are valid. ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors - v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat) - v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture) ## Ship % update - MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's algorithm-level discharge of step 2; the actual capture surface expansion happened in PR #1414). - MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge (#1418) Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…+ empirical bug location (#1423) * contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference stage generator Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached to each natural per-layer module and dumps the activations in the same APRT byte format that `apr trace --save-tensor` produces. Output layout mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so `apr diff --values <apr>.bin <hf>.bin` works element-wise without any rewriting. Captured 13/16 stages directly: - Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out - Whole-model (2): final_norm, lm_head Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need deeper instrumentation since HF stores Q/K/V separately + RoPE is internal to self_attn. Deferred to a follow-up; the 13 captured stages already cover all major points along the forward pass. Five Whys 1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff needs a ground-truth oracle to compare APR Q4K against; FP16 is the closest published reference for this model. 2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors` on disk? That's the same Q4K data we already feed into APR — diffing it against APR would only catch APR-side bugs that change weights, not bugs in forward computation. We need an INDEPENDENT reference. 3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py is auto-loaded via `trust_remote_code=True`; the hooks let us inspect every stage without forking HF's source. 4. Why APRT byte format (not torch.save)? `apr diff --values` already recognizes APRT files (PR #1413) — using the same format makes the diff a one-liner. Drift-prevention: same format on both sides keeps comparison shape-agnostic. 5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable 13 stages is high-leverage; the remaining 3 require manual q+k+v concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but blocking on it would delay every other stage's bisection signal. Note: This script is NOT auto-run in CI — it requires HF cache containing `Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at ~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs it once via `uv run --with torch --with transformers` to produce the fixture; downstream `apr diff` passes are deterministic byte comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): layer-0 oracle bisection — attn_out is the first diverging stage End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct teacher with HF FP16 ground truth (CPU forward, HF cache hit). Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16: | Stage | Cosine sim | Status | |-------------------|---------------|--------------------------------------------| | embedding | 1.0000000000 | bit-identical (correct) | | attn_norm | 0.9999999483 | within Q4K noise (correct) | | **attn_out** | **0.9966403** | **FIRST DROP — bug is in attention block** | | ffn_* (downstream)| 0.996-0.999 | carries drift (downstream artifacts) | | final_norm | 0.9932669898 | (whole-model — accumulates 28 layers) | | lm_head | 0.9969170161 | (whole-model — last-token logits) | This narrows the SHIP-007 root cause to the layer-0 attention block, specifically between RMSNorm output (cos=0.99999995, correct) and post-O-proj attention output (cos=0.9966, wrong). Possible bug sites within the block: 1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture 2. qkv_bias 3. RoPE on Q/K 4. Q@Kᵀ scaled-dot-product 5. Softmax with causal mask 6. softmax @ V 7. O-projection (Q4K matmul × O-proj weight) Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py` with qkv_matmul / qkv_bias / attention capture (currently deferred to PARTIAL coverage), re-run diff, pinpoint the divergent kernel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math (#1424) Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection from "inside attention block" (v1) to "between qkv_bias and attention" (v2). ## Refined cosine table | Stage | Cosine | Δ from prev | Status | |---------------|-----------|-------------|-----------------------| | attn_norm | 0.9999999 | -5e-8 | RMSNorm correct | | qkv_matmul | 0.99969 | -3.1e-4 | Q4K matmul noise (OK) | | qkv_bias | 0.9999975 | +2.8e-4 | bias dampens | | **attention** | 0.99858 | **-1.4e-3** | **← bug is here** | | attn_out | 0.99664 | -1.9e-3 | O-proj amplifies | Bug is **between qkv_bias and attention** = inside the attention math: RoPE / Q@Kᵀ / scale / causal mask / softmax / V@. NOT in QKV matmul (acceptable Q4K noise). NOT in QKV bias add (within FP precision). O-projection adds its own ~1.9e-3 cosine drop — secondary. ## Implementation New HF-side hooks: - `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias) - `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT, which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj) Script now produces 15 stage files (was 12 in v1). ## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975 Mathematical artifact, not a bug: - qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16 - qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector) - Adding deterministic vector dominates direction → relative noise dampens - Both APR and HF add the same bias values → cos increases on both sides equally Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side). ## Five Whys 1. Why need qkv stage captures? v1 only narrowed bug to "attention block" — not enough to drive a fix. We need to know if the bug is in the projections or the attention math. 2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition is a known mathematical artifact with deterministic vectors. 3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975 → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops (RoPE, scale, softmax, mask, V@) introduce real divergence. 4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable. 5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate. Without finer-grained captures inside HF's monolithic Qwen2Attention, v2 cannot bisect further. Future work: instrument HF's attention internals OR cross-reference candle/pytorch for the algebraic spec of each sub-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path (#1425) * evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`) against HF Transformers Qwen2 to identify the algebraic source of the v2-measured 1.4e-3 cosine drop between qkv_bias and attention. ## Audit result: NO algebraic bug in APR attention Verified MATCHES vs HF Qwen2: - RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos) - RoPE freq formula (1/theta^(2i/d)) - rope_theta value (1000000.0 from `metadata.rope_theta`) - Attention scale (1/sqrt(head_dim)) - Causal mask (`for j in 0..=i` triangular) - Softmax (f32 max-subtract) - QKV bias position (post-matmul, pre-RoPE) - GQA-7:1 head indexing (`kv_head = head/group_size`) ## Refined hypothesis The 1.4e-3 cosine drop is most likely **systematic precision loss from Q4K dequant compounding through attention math**, NOT a structural algorithmic bug. Specifically: 1. APR's `forward_traced` uses F32 dequantized Q4K weights (per `inference.rs:38` comment "Q4K layers not used in traced forward"). 2. The Q4K dequant is lossy (~1e-3 RMS per element). 3. When these slightly-off Q values are dotted against slightly-off K values (also from Q4K dequant), the product compounds the error. 4. This compounding produces cos=0.99858 at attention output — consistent with systematic precision loss, not a bug. ## Implication for SHIP-007 fix If this hypothesis is right, the layer-0 attention bisection has hit the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run` quality issue may be: (a) Further downstream — accumulating drift through 28 layers (b) NOT a forward-pass bug at all — could be sampling/decoding config (c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while `forward_traced` uses F32 dequant (more accurate path); the two might diverge in how the kernel handles edge cases ## Next narrowing tests 1. Run `apr trace --save-tensor` on the FP16 safetensors version of the teacher; if cos improves to >0.999 across all stages, confirms (a)/(c). 2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth. 3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1, the drift is "noise" not bug-relevant. Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path Two falsifying live tests run on canonical 7B teacher reframe SHIP-007 fundamentally: ## Test 1: lm_head argmax MATCHES APR forward_traced top-1 token: 220 (' ') HF FP16 top-1 token: 220 (' ') First 3 top-5 ranks identical: [220, 576, 2014] → The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for greedy decoding. It's just systematic precision noise. ## Test 2: `apr run --temperature 0.0` produces gibberish $ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \ --max-tokens 16 --temperature 0.0 ampiezza = 0.5 diametro = 10 → Italian-looking gibberish, NOT '4', NOT a coherent answer. ## Test 3: even `--max-tokens 1` disagrees with forward_traced $ apr run [...] --max-tokens 1 --temperature 0.0 ampie → Single-step apr run produces different first token than forward_traced (which argmaxed to 220 ' '). ## The pivot The SHIP-007 bug is NOT in the forward pass instrumented by `forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path: | Path | Backend | Weights | Output for "What is 2+2?" | |------------------|--------------------------------|---------|---------------------------| | forward_traced | CPU scalar-loop matmul | F32 | argmax=220 (' ', matches HF) | | apr run | CUDA graph (646 kernels) + wgpu | F32 | "ampie..." (gibberish) | Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before GPU upload, per PMAT-333 log line). The divergence is in **kernel implementations** — CPU scalar loops vs CUDA/wgpu kernels. ## All previous findings invalidated - v1 "bug is in attention block" — INVALID (was just Q4K precision noise) - v2 "bug is between qkv_bias and attention" — INVALID (same) - v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced IS correct), but missed that the actual broken path is `apr run` GPU. The forward_traced bisection chain (cos drops at attention) is now understood as a RED HERRING — it captures a different code path than the buggy one. ## Next narrowing 1. Force `apr run` to use CPU (env var or feature flag) — does it match forward_traced? If yes, confirms GPU/wgpu parity bug. 2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced. 3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for forward-pass bugs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 07:36

Merge branch 'main' into feat/apr-diff-values-aprt-stage-tensor-v1

d9dde9f

noahgift mentioned this pull request May 3, 2026

feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper #1414

Merged

3 tasks

noahgift merged commit e9294fa into main May 3, 2026
10 checks passed

noahgift deleted the feat/apr-diff-values-aprt-stage-tensor-v1 branch May 3, 2026 08:11

noahgift mentioned this pull request May 3, 2026

contract(apr-cli-trace-save-tensor-v1): v1.1.0 → v1.2.0 — FALSIFY-010 records LmHead step-2 PARTIAL #1415

Merged

2 tasks

noahgift mentioned this pull request May 3, 2026

feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference + empirical bug location #1423

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors#1413

feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors#1413
noahgift merged 2 commits into
mainfrom
feat/apr-diff-values-aprt-stage-tensor-v1

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

Why now

What changed

Test plan

Five Whys

Ship % update

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant