feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference + empirical bug location by noahgift · Pull Request #1423 · paiml/aprender

noahgift · 2026-05-03T12:13:52Z

Summary

Authors scripts/generate_qwen25_coder_fp16_stages.py AND ships the first empirical SHIP-007 bug-location result.

The script extracts FP16 forward-pass intermediate stages from Qwen/Qwen2.5-Coder-7B-Instruct via HF transformers + forward hooks, dumps them in the same APRT byte format produced by apr trace --save-tensor, and pairs with the existing APR side for apr diff --values element-wise comparison.

Live ran on noah-Lambda-Vector (CPU, HF cache already populated, no download). Combined with the 16-stage APR-side smoke from PR #1419/#1421, this produces the first end-to-end empirical bisection of the SHIP-007 root cause.

Bisection result (the actual ship-progress finding)

Stage	Cosine sim	Status
embedding	1.0000000000	bit-identical
attn_norm	0.9999999483	Q4K noise floor
attn_out	0.9966	← FIRST significant drop — bug is here
ffn_norm	0.9959	downstream
ffn_gate/up/...	0.996-0.999	downstream
final_norm	0.9933	(whole-model, accumulates)
lm_head	0.9969	(whole-model)

The bug is inside the layer-0 attention block between RMSNorm output and the post-O-proj attention output.

Files

scripts/generate_qwen25_coder_fp16_stages.py — 320-line Python script (uv-based, no pip)
evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings.md — full bisection table + reproducer + interpretation

Five Whys (full detail in commit messages)

Why need this script? SHIP-007 needs an oracle-grade reference to compare APR Q4K against — the 16-stage APR side from PR feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced #1421 has no comparison target without it.
Why HF instead of llama.cpp? llama.cpp Q4K is the same quantization scheme as APR — diffing them would only catch APR-specific bugs that change weights, not forward computation bugs. Need an INDEPENDENT FP16 reference.
Why APRT format on the HF side? apr diff --values (PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413) already speaks APRT; using the same format on both sides makes the comparison a one-liner with no rewriting.
Why CPU forward? PyTorch+CUDA version mismatch with the system NVIDIA driver (12.080); CPU FP16 7B forward is feasible and ran in seconds for one pass.
Why skip qkv_matmul/qkv_bias/attention on the HF side? Those need q/k/v tensor concatenation + RoPE re-derivation outside HF's monolithic self_attn module. Deferred to a follow-up; the 13 captured stages already covered enough to pinpoint the bug to the attention block.

Test plan

Script runs end-to-end on canonical 7B teacher in seconds (CPU)
All 12 captured stages have correct APRT byte format (12-byte header + dim_product f32)
embedding diff → cos=1.0000000 confirms tokenizer parity (input pipeline correct)
First diverging stage identified empirically: attn_out
CI green (script alone is markdown + python; doesn't touch Rust/CI)
Auto-merge on green

🤖 Generated with Claude Code

…discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@v

…stage generator Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached to each natural per-layer module and dumps the activations in the same APRT byte format that `apr trace --save-tensor` produces. Output layout mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so `apr diff --values <apr>.bin <hf>.bin` works element-wise without any rewriting. Captured 13/16 stages directly: - Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out - Whole-model (2): final_norm, lm_head Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need deeper instrumentation since HF stores Q/K/V separately + RoPE is internal to self_attn. Deferred to a follow-up; the 13 captured stages already cover all major points along the forward pass. Five Whys 1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff needs a ground-truth oracle to compare APR Q4K against; FP16 is the closest published reference for this model. 2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors` on disk? That's the same Q4K data we already feed into APR — diffing it against APR would only catch APR-side bugs that change weights, not bugs in forward computation. We need an INDEPENDENT reference. 3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py is auto-loaded via `trust_remote_code=True`; the hooks let us inspect every stage without forking HF's source. 4. Why APRT byte format (not torch.save)? `apr diff --values` already recognizes APRT files (PR #1413) — using the same format makes the diff a one-liner. Drift-prevention: same format on both sides keeps comparison shape-agnostic. 5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable 13 stages is high-leverage; the remaining 3 require manual q+k+v concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but blocking on it would delay every other stage's bisection signal. Note: This script is NOT auto-run in CI — it requires HF cache containing `Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at ~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs it once via `uv run --with torch --with transformers` to produce the fixture; downstream `apr diff` passes are deterministic byte comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…diverging stage End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct teacher with HF FP16 ground truth (CPU forward, HF cache hit). Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16: | Stage | Cosine sim | Status | |-------------------|---------------|--------------------------------------------| | embedding | 1.0000000000 | bit-identical (correct) | | attn_norm | 0.9999999483 | within Q4K noise (correct) | | **attn_out** | **0.9966403** | **FIRST DROP — bug is in attention block** | | ffn_* (downstream)| 0.996-0.999 | carries drift (downstream artifacts) | | final_norm | 0.9932669898 | (whole-model — accumulates 28 layers) | | lm_head | 0.9969170161 | (whole-model — last-token logits) | This narrows the SHIP-007 root cause to the layer-0 attention block, specifically between RMSNorm output (cos=0.99999995, correct) and post-O-proj attention output (cos=0.9966, wrong). Possible bug sites within the block: 1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture 2. qkv_bias 3. RoPE on Q/K 4. Q@Kᵀ scaled-dot-product 5. Softmax with causal mask 6. softmax @ V 7. O-projection (Q4K matmul × O-proj weight) Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py` with qkv_matmul / qkv_bias / attention capture (currently deferred to PARTIAL coverage), re-run diff, pinpoint the divergent kernel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@v

…arrow bug to attention math (#1424) Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection from "inside attention block" (v1) to "between qkv_bias and attention" (v2). ## Refined cosine table | Stage | Cosine | Δ from prev | Status | |---------------|-----------|-------------|-----------------------| | attn_norm | 0.9999999 | -5e-8 | RMSNorm correct | | qkv_matmul | 0.99969 | -3.1e-4 | Q4K matmul noise (OK) | | qkv_bias | 0.9999975 | +2.8e-4 | bias dampens | | **attention** | 0.99858 | **-1.4e-3** | **← bug is here** | | attn_out | 0.99664 | -1.9e-3 | O-proj amplifies | Bug is **between qkv_bias and attention** = inside the attention math: RoPE / Q@Kᵀ / scale / causal mask / softmax / V@. NOT in QKV matmul (acceptable Q4K noise). NOT in QKV bias add (within FP precision). O-projection adds its own ~1.9e-3 cosine drop — secondary. ## Implementation New HF-side hooks: - `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias) - `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT, which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj) Script now produces 15 stage files (was 12 in v1). ## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975 Mathematical artifact, not a bug: - qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16 - qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector) - Adding deterministic vector dominates direction → relative noise dampens - Both APR and HF add the same bias values → cos increases on both sides equally Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side). ## Five Whys 1. Why need qkv stage captures? v1 only narrowed bug to "attention block" — not enough to drive a fix. We need to know if the bug is in the projections or the attention math. 2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition is a known mathematical artifact with deterministic vectors. 3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975 → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops (RoPE, scale, softmax, mask, V@) introduce real divergence. 4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable. 5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate. Without finer-grained captures inside HF's monolithic Qwen2Attention, v2 cannot bisect further. Future work: instrument HF's attention internals OR cross-reference candle/pytorch for the algebraic spec of each sub-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…PU execution path (#1425) * evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`) against HF Transformers Qwen2 to identify the algebraic source of the v2-measured 1.4e-3 cosine drop between qkv_bias and attention. ## Audit result: NO algebraic bug in APR attention Verified MATCHES vs HF Qwen2: - RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos) - RoPE freq formula (1/theta^(2i/d)) - rope_theta value (1000000.0 from `metadata.rope_theta`) - Attention scale (1/sqrt(head_dim)) - Causal mask (`for j in 0..=i` triangular) - Softmax (f32 max-subtract) - QKV bias position (post-matmul, pre-RoPE) - GQA-7:1 head indexing (`kv_head = head/group_size`) ## Refined hypothesis The 1.4e-3 cosine drop is most likely **systematic precision loss from Q4K dequant compounding through attention math**, NOT a structural algorithmic bug. Specifically: 1. APR's `forward_traced` uses F32 dequantized Q4K weights (per `inference.rs:38` comment "Q4K layers not used in traced forward"). 2. The Q4K dequant is lossy (~1e-3 RMS per element). 3. When these slightly-off Q values are dotted against slightly-off K values (also from Q4K dequant), the product compounds the error. 4. This compounding produces cos=0.99858 at attention output — consistent with systematic precision loss, not a bug. ## Implication for SHIP-007 fix If this hypothesis is right, the layer-0 attention bisection has hit the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run` quality issue may be: (a) Further downstream — accumulating drift through 28 layers (b) NOT a forward-pass bug at all — could be sampling/decoding config (c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while `forward_traced` uses F32 dequant (more accurate path); the two might diverge in how the kernel handles edge cases ## Next narrowing tests 1. Run `apr trace --save-tensor` on the FP16 safetensors version of the teacher; if cos improves to >0.999 across all stages, confirms (a)/(c). 2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth. 3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1, the drift is "noise" not bug-relevant. Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path Two falsifying live tests run on canonical 7B teacher reframe SHIP-007 fundamentally: ## Test 1: lm_head argmax MATCHES APR forward_traced top-1 token: 220 (' ') HF FP16 top-1 token: 220 (' ') First 3 top-5 ranks identical: [220, 576, 2014] → The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for greedy decoding. It's just systematic precision noise. ## Test 2: `apr run --temperature 0.0` produces gibberish $ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \ --max-tokens 16 --temperature 0.0 ampiezza = 0.5 diametro = 10 → Italian-looking gibberish, NOT '4', NOT a coherent answer. ## Test 3: even `--max-tokens 1` disagrees with forward_traced $ apr run [...] --max-tokens 1 --temperature 0.0 ampie → Single-step apr run produces different first token than forward_traced (which argmaxed to 220 ' '). ## The pivot The SHIP-007 bug is NOT in the forward pass instrumented by `forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path: | Path | Backend | Weights | Output for "What is 2+2?" | |------------------|--------------------------------|---------|---------------------------| | forward_traced | CPU scalar-loop matmul | F32 | argmax=220 (' ', matches HF) | | apr run | CUDA graph (646 kernels) + wgpu | F32 | "ampie..." (gibberish) | Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before GPU upload, per PMAT-333 log line). The divergence is in **kernel implementations** — CPU scalar loops vs CUDA/wgpu kernels. ## All previous findings invalidated - v1 "bug is in attention block" — INVALID (was just Q4K precision noise) - v2 "bug is between qkv_bias and attention" — INVALID (same) - v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced IS correct), but missed that the actual broken path is `apr run` GPU. The forward_traced bisection chain (cos drops at attention) is now understood as a RED HERRING — it captures a different code path than the buggy one. ## Next narrowing 1. Force `apr run` to use CPU (env var or feature flag) — does it match forward_traced? If yes, confirms GPU/wgpu parity bug. 2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced. 3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for forward-pass bugs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…cision (#1449) After §45 landed the 5/5 DISCHARGE milestone for `apr-cpu-vs-gpu-output-parity-v1`, the natural follow-up question is whether the 238 commits accumulated since v0.31.2 (2026-04-19) warrant a `cargo publish` cut today. ## Verdict: HOLD The release-readiness audit found exactly one load-bearing blocker — SHIP-007 layer-0 attention divergence is empirically pinpointed (cos=0.99999995 attn_norm → 0.9966 attn_out per memory `2026-05-03 SHIP-007 finding`) but **not yet fixed**, so cutting v0.32.0 today would crates.io-ship a binary where `apr run` on a 7B GPU teacher still emits gibberish unless the user passes `--no-gpu`. The §41-§45 jidoka armor makes the failure visible + fail-closed (which is shippable behaviour), but a user-facing `## [0.32.0]` headline that reads "5/5 DISCHARGE on apr-cpu-vs-gpu-output-parity-v1" implies the GPU correctness hole is closed when in truth it is only contained. Per `feedback_fix_root_cause_never_route_around.md`: route-around-via-fallback is acceptable as a *temporary jidoka layer*, but it is muda to ship a release whose headline claims a fix that doesn't exist. ## What §46 records | Subsection | Content | |---|---| | 46.1 | What's accumulated since v0.31.2 (8-row headline table) | | 46.2 | Release-readiness gate audit (6-gate verdict table) | | 46.3 | Why SHIP-007 is the load-bearing blocker | | 46.4 | Pre-flight artifacts shipped alongside this decision (PR #1448) | | 46.5 | **Pre-conditions for the v0.32.0 cut** — 5 falsifiable gates | | 46.6 | Five Whys (why hold / why now / why §46 not extend §45) | | 46.7 | Net effects + ranked open follow-ups | | 46.8 | Spec amendment cadence preserved | ## Pre-conditions (§46.5) for the future v0.32.0 cut 1. SHIP-007 layer-0 attention divergence FIXED (cos ≥ 0.999 at every sub-stage) 2. PR #1448 merged (CHANGELOG `[Unreleased]` populated; README drift gate GREEN) 3. Workspace version bumped 0.31.2 → 0.32.0; `## [0.32.0] - <date>` heading 4. Post-publish QA: `cargo install aprender --force` + /dogfood GO verdict (per `feedback_post_publish_qa_required.md` — v0.31.1 was yanked for skipping) 5. Drift gates GREEN: `check_readme_claims.sh` + `pv validate` + `cargo deny` ## Net effects - Spec v2.90.0 → **v2.91.0** - MODEL-1 ship %: unchanged at 91% (this is metadata, not a falsifier flip) - MODEL-2 ship %: unchanged at 57% - Coverage tally: unchanged (no PARTIAL → DISCHARGED in this cycle) - Single highest-leverage next-session deliverable: SHIP-007 layer-0 attention bisection using `apr trace --save-tensor` + HF FP16 oracle from PR #1423 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…— FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL (#1451) * contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450). Adds the 2 new attention sub-stage variants to `SaveTensorStage`: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask - `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply Closes the SHIP-007 layer-0 attention bisection gap inside the Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out ## What changed | File | Change | |---|---| | `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) | | `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) | | `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 | ## Test results - `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed** - 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`, `falsify_attn_sub_001_attn_softmax_round_trip`, `falsify_attn_sub_001_2_new_stages_in_canonical_order`, `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`, `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain` - `cargo check --workspace --lib` — clean ## Falsifier discharge | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) | Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the contract is still PROPOSED upstream. ## Five Whys 1. **Why this PR before #1450 lands?** Contract+impl can land together — #1450 introduces the contract, this PR provides the first implementation evidence. They reference each other and merge in either order without conflict. 2. **Why only the enum + tests, not `forward_traced_with_plan`?** Enum extension is the smallest atomic ticket per Toyota Way (one mechanism per PR). Threading the new variants through forward capture is the next PR (FALSIFY-ATTN-SUB-002 discharge). 3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention in `ALL`?** That's the canonical computation order pinned by the contract's ordering proof_obligation: `QkvBias → QPostRope → KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`. 4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`, not a separate variant. The enum has 20 distinct variants; `ALL` excludes the alias only at the `FromStr` layer. 5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain` test?** The contract's `bisection_chain_layer_0` equation pins the 9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004. This test pins the parser side of that gate so a future drift in stage names breaks loudly. ## Net effects - 2 new `SaveTensorStage` variants land - 5 new tests pin the variants + ordering + parser - MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007 bisection cascade; ship % moves when a falsifier flips DISCHARGED) - MODEL-2 ship %: unchanged at 57% - Implementation cascade ready to thread variants through `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…ith_plan — FALSIFY-ATTN-SUB-002 (#1455) * contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450). Adds the 2 new attention sub-stage variants to `SaveTensorStage`: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask - `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply Closes the SHIP-007 layer-0 attention bisection gap inside the Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out ## What changed | File | Change | |---|---| | `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) | | `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) | | `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 | ## Test results - `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed** - 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`, `falsify_attn_sub_001_attn_softmax_round_trip`, `falsify_attn_sub_001_2_new_stages_in_canonical_order`, `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`, `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain` - `cargo check --workspace --lib` — clean ## Falsifier discharge | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) | Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the contract is still PROPOSED upstream. ## Five Whys 1. **Why this PR before #1450 lands?** Contract+impl can land together — #1450 introduces the contract, this PR provides the first implementation evidence. They reference each other and merge in either order without conflict. 2. **Why only the enum + tests, not `forward_traced_with_plan`?** Enum extension is the smallest atomic ticket per Toyota Way (one mechanism per PR). Threading the new variants through forward capture is the next PR (FALSIFY-ATTN-SUB-002 discharge). 3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention in `ALL`?** That's the canonical computation order pinned by the contract's ordering proof_obligation: `QkvBias → QPostRope → KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`. 4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`, not a separate variant. The enum has 20 distinct variants; `ALL` excludes the alias only at the `FromStr` layer. 5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain` test?** The contract's `bisection_chain_layer_0` equation pins the 9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004. This test pins the parser side of that gate so a future drift in stage names breaks loudly. ## Net effects - 2 new `SaveTensorStage` variants land - 5 new tests pin the variants + ordering + parser - MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007 bisection cascade; ship % moves when a falsifier flips DISCHARGED) - MODEL-2 ship %: unchanged at 57% - Implementation cascade ready to thread variants through `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451 merges to main, this PR rebases cleanly and lands as a 4-stage wire fix. ## What this PR wires | Stage | Existed in enum? | emit() existed? | After this PR | |---|---|---|---| | QPostRope | YES | NO | YES (new emit) | | KPostRope | YES | NO | YES (new emit) | | AttnScores | NEW (#1451) | NO | YES (new emit + accumulator) | | AttnSoftmax | NEW (#1451) | NO | YES (new emit + accumulator) | Closes the parent-contract drift discovered in PR #1452 research evidence: QPostRope + KPostRope were in the SaveTensorStage enum but had no emit() calls in forward_traced_with_plan. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated coverage for those 2 stages. This PR closes the drift as a side-effect. ## Implementation details **QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the inner loop populates them. Tensors already exist; this is just 2 emit() calls — zero new allocation. **AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside the inner softmax loop, populate per (head, i, j) — zero overhead when plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005: additive purity). Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation scales O(num_heads * seq^2) and is gated by plan. ## Test results - `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed, 0 failed, 51 ignored** - `cargo check -p aprender-serve --lib` — clean - inference_trace tests: 167/167 PASS - (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change) ## Falsifier discharge map | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() | ## Five Whys 1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing gaps in the parent contract; the same-file fix is a free side-effect per Toyota Way "all defects are mine". 2. **Why allocate accumulators only when requested?** O(num_heads * seq^2) memory shouldn't be paid on the default forward path. Plan-gating keeps the production inference path zero-overhead. 3. **Why insert capture at lines 133, 152, 160 specifically?** Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`: line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after scale (pre-softmax), line 160 = post-softmax probs. 4. **Why use scores_all.is_some() check vs always-allocate?** Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer regardless of capture. Some(Vec) idiom plus is_some_and check is the idiomatic Rust pattern for conditional capture. 5. **Why this PR stacked on #1451 rather than off main?** Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only exist on #1451's branch. When #1451 merges, this rebases to main as a clean 51-line delta. ## Net effects - 4 stages now wired in `forward_traced_with_plan` - MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% - Cascade step 4/8 of §47.1 roadmap delivered 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…on bisection plan (2 new SaveTensorStage variants + 9-stage chain) (#1450) * contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 3 commits May 3, 2026 14:06

noahgift enabled auto-merge (squash) May 3, 2026 12:13

noahgift mentioned this pull request May 3, 2026

feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math #1424

Merged

7 tasks

noahgift mentioned this pull request May 3, 2026

evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path #1425

Merged

6 tasks

noahgift merged commit f6369d2 into main May 3, 2026
10 checks passed

noahgift deleted the feat/ship-007-hf-fp16-reference-script branch May 3, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference + empirical bug location#1423

feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference + empirical bug location#1423
noahgift merged 5 commits into
mainfrom
feat/ship-007-hf-fp16-reference-script

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

Bisection result (the actual ship-progress finding)

Files

Five Whys (full detail in commit messages)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant