feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced by noahgift · Pull Request #1421 · paiml/aprender

noahgift · 2026-05-03T11:15:28Z

Summary

SHIP-007 PR-C-real step 3 — threads Option<&SaveTensorPlan> through forward_traced itself for all 16 capture points (14 per-layer + 2 whole-model) in a single forward pass.

Stacked on #1416 (which ships the maybe_save_stage helper).

Live smoke on canonical 7B teacher (RTX 4090 lambda-labs)

apr trace <model.apr> --save-tensor "embedding,attn_norm,qkv_matmul,qkv_bias,\
attention,attn_out,post_attn_residual,ffn_norm,ffn_gate,ffn_up,ffn_silu,\
ffn_swigl,ffn_out,post_ffn_residual,final_norm,lm_head" --save-tensor-dir /tmp/...

Wrote 16 stage tensor file(s):

14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual
2 whole-model (root/*): final_norm, lm_head

Buffer sizes match Qwen2.5-Coder-7B config:

100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header
530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates)
129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim)
608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head)

What changed

AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>) is the new private impl that threads the plan through every natural capture point.
forward_traced(tokens) becomes a 1-line wrapper that calls it with None (zero-overhead — maybe_save_stage early-returns on None).
forward_traced_with_save_tensor is now a pure delegator: no more double-embed or post-loop re-emission of LmHead.
All 6 forward_traced regression tests pass. All 4 traced_save_tensor_step2_tests pass.

Five Whys

Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per apr-cli-trace-save-tensor-v1.yaml.
Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 apr run divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points.
Why single-pass via Option<&SaveTensorPlan>? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY.
Why Option<&SaveTensorPlan> (optional)? Existing forward_traced(tokens) is called from many test sites and the trait TracedForward. Optional preserves the public API.
Why bare emit(Stage, layer, &buf)? calls? Each capture is a one-liner; helpers would mask the natural buffer-name → stage-name pairing and make audit harder.

Test plan

cargo build -p aprender-serve --release --features cuda — clean
cargo test -p aprender-serve --lib forward_traced — 6 passing
cargo test -p aprender-serve --lib traced_save_tensor — 4 passing
Live smoke on 7B teacher (16/16 stages captured, byte sizes verified)
CI green (workspace-test + ci/gate)
Auto-merge on green (after refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading #1416 lands)

🤖 Generated with Claude Code

…rPlan threading through forward_traced End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…7 forward_traced threading (#1416) * refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…discharge for FALSIFY-009/010/011 (#1422) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…+ empirical bug location (#1423) * contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference stage generator Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached to each natural per-layer module and dumps the activations in the same APRT byte format that `apr trace --save-tensor` produces. Output layout mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so `apr diff --values <apr>.bin <hf>.bin` works element-wise without any rewriting. Captured 13/16 stages directly: - Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out - Whole-model (2): final_norm, lm_head Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need deeper instrumentation since HF stores Q/K/V separately + RoPE is internal to self_attn. Deferred to a follow-up; the 13 captured stages already cover all major points along the forward pass. Five Whys 1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff needs a ground-truth oracle to compare APR Q4K against; FP16 is the closest published reference for this model. 2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors` on disk? That's the same Q4K data we already feed into APR — diffing it against APR would only catch APR-side bugs that change weights, not bugs in forward computation. We need an INDEPENDENT reference. 3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py is auto-loaded via `trust_remote_code=True`; the hooks let us inspect every stage without forking HF's source. 4. Why APRT byte format (not torch.save)? `apr diff --values` already recognizes APRT files (PR #1413) — using the same format makes the diff a one-liner. Drift-prevention: same format on both sides keeps comparison shape-agnostic. 5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable 13 stages is high-leverage; the remaining 3 require manual q+k+v concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but blocking on it would delay every other stage's bisection signal. Note: This script is NOT auto-run in CI — it requires HF cache containing `Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at ~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs it once via `uv run --with torch --with transformers` to produce the fixture; downstream `apr diff` passes are deterministic byte comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): layer-0 oracle bisection — attn_out is the first diverging stage End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct teacher with HF FP16 ground truth (CPU forward, HF cache hit). Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16: | Stage | Cosine sim | Status | |-------------------|---------------|--------------------------------------------| | embedding | 1.0000000000 | bit-identical (correct) | | attn_norm | 0.9999999483 | within Q4K noise (correct) | | **attn_out** | **0.9966403** | **FIRST DROP — bug is in attention block** | | ffn_* (downstream)| 0.996-0.999 | carries drift (downstream artifacts) | | final_norm | 0.9932669898 | (whole-model — accumulates 28 layers) | | lm_head | 0.9969170161 | (whole-model — last-token logits) | This narrows the SHIP-007 root cause to the layer-0 attention block, specifically between RMSNorm output (cos=0.99999995, correct) and post-O-proj attention output (cos=0.9966, wrong). Possible bug sites within the block: 1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture 2. qkv_bias 3. RoPE on Q/K 4. Q@Kᵀ scaled-dot-product 5. Softmax with causal mask 6. softmax @ V 7. O-projection (Q4K matmul × O-proj weight) Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py` with qkv_matmul / qkv_bias / attention capture (currently deferred to PARTIAL coverage), re-run diff, pinpoint the divergent kernel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math (#1424) Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection from "inside attention block" (v1) to "between qkv_bias and attention" (v2). ## Refined cosine table | Stage | Cosine | Δ from prev | Status | |---------------|-----------|-------------|-----------------------| | attn_norm | 0.9999999 | -5e-8 | RMSNorm correct | | qkv_matmul | 0.99969 | -3.1e-4 | Q4K matmul noise (OK) | | qkv_bias | 0.9999975 | +2.8e-4 | bias dampens | | **attention** | 0.99858 | **-1.4e-3** | **← bug is here** | | attn_out | 0.99664 | -1.9e-3 | O-proj amplifies | Bug is **between qkv_bias and attention** = inside the attention math: RoPE / Q@Kᵀ / scale / causal mask / softmax / V@. NOT in QKV matmul (acceptable Q4K noise). NOT in QKV bias add (within FP precision). O-projection adds its own ~1.9e-3 cosine drop — secondary. ## Implementation New HF-side hooks: - `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias) - `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT, which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj) Script now produces 15 stage files (was 12 in v1). ## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975 Mathematical artifact, not a bug: - qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16 - qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector) - Adding deterministic vector dominates direction → relative noise dampens - Both APR and HF add the same bias values → cos increases on both sides equally Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side). ## Five Whys 1. Why need qkv stage captures? v1 only narrowed bug to "attention block" — not enough to drive a fix. We need to know if the bug is in the projections or the attention math. 2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition is a known mathematical artifact with deterministic vectors. 3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975 → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops (RoPE, scale, softmax, mask, V@) introduce real divergence. 4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable. 5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate. Without finer-grained captures inside HF's monolithic Qwen2Attention, v2 cannot bisect further. Future work: instrument HF's attention internals OR cross-reference candle/pytorch for the algebraic spec of each sub-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path (#1425) * evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`) against HF Transformers Qwen2 to identify the algebraic source of the v2-measured 1.4e-3 cosine drop between qkv_bias and attention. ## Audit result: NO algebraic bug in APR attention Verified MATCHES vs HF Qwen2: - RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos) - RoPE freq formula (1/theta^(2i/d)) - rope_theta value (1000000.0 from `metadata.rope_theta`) - Attention scale (1/sqrt(head_dim)) - Causal mask (`for j in 0..=i` triangular) - Softmax (f32 max-subtract) - QKV bias position (post-matmul, pre-RoPE) - GQA-7:1 head indexing (`kv_head = head/group_size`) ## Refined hypothesis The 1.4e-3 cosine drop is most likely **systematic precision loss from Q4K dequant compounding through attention math**, NOT a structural algorithmic bug. Specifically: 1. APR's `forward_traced` uses F32 dequantized Q4K weights (per `inference.rs:38` comment "Q4K layers not used in traced forward"). 2. The Q4K dequant is lossy (~1e-3 RMS per element). 3. When these slightly-off Q values are dotted against slightly-off K values (also from Q4K dequant), the product compounds the error. 4. This compounding produces cos=0.99858 at attention output — consistent with systematic precision loss, not a bug. ## Implication for SHIP-007 fix If this hypothesis is right, the layer-0 attention bisection has hit the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run` quality issue may be: (a) Further downstream — accumulating drift through 28 layers (b) NOT a forward-pass bug at all — could be sampling/decoding config (c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while `forward_traced` uses F32 dequant (more accurate path); the two might diverge in how the kernel handles edge cases ## Next narrowing tests 1. Run `apr trace --save-tensor` on the FP16 safetensors version of the teacher; if cos improves to >0.999 across all stages, confirms (a)/(c). 2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth. 3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1, the drift is "noise" not bug-relevant. Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path Two falsifying live tests run on canonical 7B teacher reframe SHIP-007 fundamentally: ## Test 1: lm_head argmax MATCHES APR forward_traced top-1 token: 220 (' ') HF FP16 top-1 token: 220 (' ') First 3 top-5 ranks identical: [220, 576, 2014] → The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for greedy decoding. It's just systematic precision noise. ## Test 2: `apr run --temperature 0.0` produces gibberish $ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \ --max-tokens 16 --temperature 0.0 ampiezza = 0.5 diametro = 10 → Italian-looking gibberish, NOT '4', NOT a coherent answer. ## Test 3: even `--max-tokens 1` disagrees with forward_traced $ apr run [...] --max-tokens 1 --temperature 0.0 ampie → Single-step apr run produces different first token than forward_traced (which argmaxed to 220 ' '). ## The pivot The SHIP-007 bug is NOT in the forward pass instrumented by `forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path: | Path | Backend | Weights | Output for "What is 2+2?" | |------------------|--------------------------------|---------|---------------------------| | forward_traced | CPU scalar-loop matmul | F32 | argmax=220 (' ', matches HF) | | apr run | CUDA graph (646 kernels) + wgpu | F32 | "ampie..." (gibberish) | Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before GPU upload, per PMAT-333 log line). The divergence is in **kernel implementations** — CPU scalar loops vs CUDA/wgpu kernels. ## All previous findings invalidated - v1 "bug is in attention block" — INVALID (was just Q4K precision noise) - v2 "bug is between qkv_bias and attention" — INVALID (same) - v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced IS correct), but missed that the actual broken path is `apr run` GPU. The forward_traced bisection chain (cos drops at attention) is now understood as a RED HERRING — it captures a different code path than the buggy one. ## Next narrowing 1. Force `apr run` to use CPU (env var or feature flag) — does it match forward_traced? If yes, confirms GPU/wgpu parity bug. 2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced. 3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for forward-pass bugs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 7b9c6bc into refactor/inference-trace-maybe-save-stage May 3, 2026
1 check passed

noahgift deleted the feat/ship-007-step3-per-layer-threading branch May 3, 2026 11:15

noahgift mentioned this pull request May 3, 2026

contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 #1422

Merged

4 tasks

noahgift mentioned this pull request May 3, 2026

feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference + empirical bug location #1423

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced#1421

feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced#1421
noahgift merged 1 commit into
refactor/inference-trace-maybe-save-stagefrom
feat/ship-007-step3-per-layer-threading

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

Live smoke on canonical 7B teacher (RTX 4090 lambda-labs)

What changed

Five Whys

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant