feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced#1421
Merged
noahgift merged 1 commit intoMay 3, 2026
Conversation
…rPlan threading through forward_traced End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7b9c6bc
into
refactor/inference-trace-maybe-save-stage
1 check passed
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…rPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…7 forward_traced threading (#1416) * refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
6 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…discharge for FALSIFY-009/010/011 (#1422) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…+ empirical bug location (#1423) * contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference stage generator Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached to each natural per-layer module and dumps the activations in the same APRT byte format that `apr trace --save-tensor` produces. Output layout mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so `apr diff --values <apr>.bin <hf>.bin` works element-wise without any rewriting. Captured 13/16 stages directly: - Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out - Whole-model (2): final_norm, lm_head Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need deeper instrumentation since HF stores Q/K/V separately + RoPE is internal to self_attn. Deferred to a follow-up; the 13 captured stages already cover all major points along the forward pass. Five Whys 1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff needs a ground-truth oracle to compare APR Q4K against; FP16 is the closest published reference for this model. 2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors` on disk? That's the same Q4K data we already feed into APR — diffing it against APR would only catch APR-side bugs that change weights, not bugs in forward computation. We need an INDEPENDENT reference. 3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py is auto-loaded via `trust_remote_code=True`; the hooks let us inspect every stage without forking HF's source. 4. Why APRT byte format (not torch.save)? `apr diff --values` already recognizes APRT files (PR #1413) — using the same format makes the diff a one-liner. Drift-prevention: same format on both sides keeps comparison shape-agnostic. 5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable 13 stages is high-leverage; the remaining 3 require manual q+k+v concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but blocking on it would delay every other stage's bisection signal. Note: This script is NOT auto-run in CI — it requires HF cache containing `Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at ~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs it once via `uv run --with torch --with transformers` to produce the fixture; downstream `apr diff` passes are deterministic byte comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): layer-0 oracle bisection — attn_out is the first diverging stage End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct teacher with HF FP16 ground truth (CPU forward, HF cache hit). Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16: | Stage | Cosine sim | Status | |-------------------|---------------|--------------------------------------------| | embedding | 1.0000000000 | bit-identical (correct) | | attn_norm | 0.9999999483 | within Q4K noise (correct) | | **attn_out** | **0.9966403** | **FIRST DROP — bug is in attention block** | | ffn_* (downstream)| 0.996-0.999 | carries drift (downstream artifacts) | | final_norm | 0.9932669898 | (whole-model — accumulates 28 layers) | | lm_head | 0.9969170161 | (whole-model — last-token logits) | This narrows the SHIP-007 root cause to the layer-0 attention block, specifically between RMSNorm output (cos=0.99999995, correct) and post-O-proj attention output (cos=0.9966, wrong). Possible bug sites within the block: 1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture 2. qkv_bias 3. RoPE on Q/K 4. Q@Kᵀ scaled-dot-product 5. Softmax with causal mask 6. softmax @ V 7. O-projection (Q4K matmul × O-proj weight) Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py` with qkv_matmul / qkv_bias / attention capture (currently deferred to PARTIAL coverage), re-run diff, pinpoint the divergent kernel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math (#1424) Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection from "inside attention block" (v1) to "between qkv_bias and attention" (v2). ## Refined cosine table | Stage | Cosine | Δ from prev | Status | |---------------|-----------|-------------|-----------------------| | attn_norm | 0.9999999 | -5e-8 | RMSNorm correct | | qkv_matmul | 0.99969 | -3.1e-4 | Q4K matmul noise (OK) | | qkv_bias | 0.9999975 | +2.8e-4 | bias dampens | | **attention** | 0.99858 | **-1.4e-3** | **← bug is here** | | attn_out | 0.99664 | -1.9e-3 | O-proj amplifies | Bug is **between qkv_bias and attention** = inside the attention math: RoPE / Q@Kᵀ / scale / causal mask / softmax / V@. NOT in QKV matmul (acceptable Q4K noise). NOT in QKV bias add (within FP precision). O-projection adds its own ~1.9e-3 cosine drop — secondary. ## Implementation New HF-side hooks: - `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias) - `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT, which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj) Script now produces 15 stage files (was 12 in v1). ## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975 Mathematical artifact, not a bug: - qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16 - qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector) - Adding deterministic vector dominates direction → relative noise dampens - Both APR and HF add the same bias values → cos increases on both sides equally Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side). ## Five Whys 1. Why need qkv stage captures? v1 only narrowed bug to "attention block" — not enough to drive a fix. We need to know if the bug is in the projections or the attention math. 2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition is a known mathematical artifact with deterministic vectors. 3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975 → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops (RoPE, scale, softmax, mask, V@) introduce real divergence. 4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable. 5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate. Without finer-grained captures inside HF's monolithic Qwen2Attention, v2 cannot bisect further. Future work: instrument HF's attention internals OR cross-reference candle/pytorch for the algebraic spec of each sub-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path (#1425) * evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`) against HF Transformers Qwen2 to identify the algebraic source of the v2-measured 1.4e-3 cosine drop between qkv_bias and attention. ## Audit result: NO algebraic bug in APR attention Verified MATCHES vs HF Qwen2: - RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos) - RoPE freq formula (1/theta^(2i/d)) - rope_theta value (1000000.0 from `metadata.rope_theta`) - Attention scale (1/sqrt(head_dim)) - Causal mask (`for j in 0..=i` triangular) - Softmax (f32 max-subtract) - QKV bias position (post-matmul, pre-RoPE) - GQA-7:1 head indexing (`kv_head = head/group_size`) ## Refined hypothesis The 1.4e-3 cosine drop is most likely **systematic precision loss from Q4K dequant compounding through attention math**, NOT a structural algorithmic bug. Specifically: 1. APR's `forward_traced` uses F32 dequantized Q4K weights (per `inference.rs:38` comment "Q4K layers not used in traced forward"). 2. The Q4K dequant is lossy (~1e-3 RMS per element). 3. When these slightly-off Q values are dotted against slightly-off K values (also from Q4K dequant), the product compounds the error. 4. This compounding produces cos=0.99858 at attention output — consistent with systematic precision loss, not a bug. ## Implication for SHIP-007 fix If this hypothesis is right, the layer-0 attention bisection has hit the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run` quality issue may be: (a) Further downstream — accumulating drift through 28 layers (b) NOT a forward-pass bug at all — could be sampling/decoding config (c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while `forward_traced` uses F32 dequant (more accurate path); the two might diverge in how the kernel handles edge cases ## Next narrowing tests 1. Run `apr trace --save-tensor` on the FP16 safetensors version of the teacher; if cos improves to >0.999 across all stages, confirms (a)/(c). 2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth. 3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1, the drift is "noise" not bug-relevant. Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path Two falsifying live tests run on canonical 7B teacher reframe SHIP-007 fundamentally: ## Test 1: lm_head argmax MATCHES APR forward_traced top-1 token: 220 (' ') HF FP16 top-1 token: 220 (' ') First 3 top-5 ranks identical: [220, 576, 2014] → The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for greedy decoding. It's just systematic precision noise. ## Test 2: `apr run --temperature 0.0` produces gibberish $ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \ --max-tokens 16 --temperature 0.0 ampiezza = 0.5 diametro = 10 → Italian-looking gibberish, NOT '4', NOT a coherent answer. ## Test 3: even `--max-tokens 1` disagrees with forward_traced $ apr run [...] --max-tokens 1 --temperature 0.0 ampie → Single-step apr run produces different first token than forward_traced (which argmaxed to 220 ' '). ## The pivot The SHIP-007 bug is NOT in the forward pass instrumented by `forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path: | Path | Backend | Weights | Output for "What is 2+2?" | |------------------|--------------------------------|---------|---------------------------| | forward_traced | CPU scalar-loop matmul | F32 | argmax=220 (' ', matches HF) | | apr run | CUDA graph (646 kernels) + wgpu | F32 | "ampie..." (gibberish) | Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before GPU upload, per PMAT-333 log line). The divergence is in **kernel implementations** — CPU scalar loops vs CUDA/wgpu kernels. ## All previous findings invalidated - v1 "bug is in attention block" — INVALID (was just Q4K precision noise) - v2 "bug is between qkv_bias and attention" — INVALID (same) - v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced IS correct), but missed that the actual broken path is `apr run` GPU. The forward_traced bisection chain (cos drops at attention) is now understood as a RED HERRING — it captures a different code path than the buggy one. ## Next narrowing 1. Force `apr run` to use CPU (env var or feature flag) — does it match forward_traced? If yes, confirms GPU/wgpu parity bug. 2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced. 3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for forward-pass bugs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SHIP-007 PR-C-real step 3 — threads
Option<&SaveTensorPlan>throughforward_traceditself for all 16 capture points (14 per-layer + 2 whole-model) in a single forward pass.Stacked on #1416 (which ships the
maybe_save_stagehelper).Live smoke on canonical 7B teacher (RTX 4090 lambda-labs)
Wrote 16 stage tensor file(s):
Buffer sizes match Qwen2.5-Coder-7B config:
What changed
AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)is the new private impl that threads the plan through every natural capture point.forward_traced(tokens)becomes a 1-line wrapper that calls it withNone(zero-overhead —maybe_save_stageearly-returns onNone).forward_traced_with_save_tensoris now a pure delegator: no more double-embed or post-loop re-emission of LmHead.forward_tracedregression tests pass. All 4traced_save_tensor_step2_testspass.Five Whys
apr-cli-trace-save-tensor-v1.yaml.apr rundivergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points.Option<&SaveTensorPlan>? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY.Option<&SaveTensorPlan>(optional)? Existingforward_traced(tokens)is called from many test sites and the traitTracedForward. Optional preserves the public API.emit(Stage, layer, &buf)?calls? Each capture is a one-liner; helpers would mask the natural buffer-name → stage-name pairing and make audit harder.Test plan
cargo build -p aprender-serve --release --features cuda— cleancargo test -p aprender-serve --lib forward_traced— 6 passingcargo test -p aprender-serve --lib traced_save_tensor— 4 passing🤖 Generated with Claude Code