evidence(ship-007): research note — forward_traced_with_plan has 2 pre-existing capture gaps (QPostRope + KPostRope) by noahgift · Pull Request #1452 · paiml/aprender

noahgift · 2026-05-04T01:52:33Z

Summary

Records pre-implementation research for FALSIFY-ATTN-SUB-002 (`trace-attn-sub-stages-v1.yaml` v1.1.0).

Discovery

While researching where to wire AttnScores + AttnSoftmax in `forward_traced_with_plan`, discovered:

`QPostRope` + `KPostRope` variants exist in the SaveTensorStage enum (lines 47-50) but have no `emit()` call in `forward_traced_with_plan`. The RoPE-rotated tensors `q_all`/`k_all` are computed at lines 130-131 but never captured.

The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstates coverage for these 2 stages.

What FALSIFY-ATTN-SUB-002 will wire

Stage	Source line	Existed in enum?
QPostRope	post line 133	YES (gap)
KPostRope	post line 133	YES (gap)
AttnScores	line 152 (per head, accumulator)	NEW (#1451)
AttnSoftmax	line 160 (per head, accumulator)	NEW (#1451)

Why an evidence file, not a 5th stacked PR

Four PRs (#1448-#1451) already in flight. A 5th stacked PR would slow CI throughput. Recording the implementation plan here so the next loop iteration can spawn the impl PR off main once #1451 merges.

Five Whys + cross-references

In `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`.

Test plan

No code change in this PR; documentation only

🤖 Generated with Claude Code

…pre-existing capture gaps (QPostRope + KPostRope) Records pre-implementation research for FALSIFY-ATTN-SUB-002 (`trace-attn-sub-stages-v1.yaml` v1.1.0). ## What this evidence pins While researching where to wire AttnScores + AttnSoftmax in `forward_traced_with_plan` (per the v1.1.0 contract), discovered that QPostRope + KPostRope variants exist in the SaveTensorStage enum (lines 47-50) but have **no `emit()` call** in `forward_traced_with_plan`. The RoPE-rotated tensors q_all + k_all are computed at lines 130-131 but never captured. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstates coverage for these 2 stages. ## What FALSIFY-ATTN-SUB-002 will wire When #1451 lands, the next PR will wire 4 capture points (not 2): | Stage | Source line | Existed in enum? | |---|---|---| | QPostRope | post line 133 | YES (gap) | | KPostRope | post line 133 | YES (gap) | | AttnScores | line 152 (per head, accumulator) | NEW (#1451) | | AttnSoftmax | line 160 (per head, accumulator) | NEW (#1451) | ## Why an evidence file, not a 5th stacked PR Four PRs (#1448-#1451) already in flight. A 5th stacked PR would slow CI throughput. Recording the implementation plan here so the next loop iteration can spawn the impl PR off main once #1451 merges. ## Five Whys + cross-references In `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`: - Five Whys for scope (4 stages, not 2) - Wire-plan with insertion points - Backward-compat test plan - Next-iteration deliverables checklist ## Net effects - Evidence file lands; no code change in this PR - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% - Unblocks the next loop iteration's atomic PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion cascade STARTED (#1454) * contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT — scaffold for P0 GPU MoE forward path Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md "NEVER write code before writing a provable contract" — this is the contract scaffold (M-stage M-GPU-MOE-0 in the contract's implementation_stages). Why P0 ====== - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M. - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s. - MoE inference is ~10× slower than dense, making the spec-prescribed default Qwen3-Coder model production-infeasible at ~30 tok/s. - The companion's action-stream parity machinery (CCPA-001..013, all DISCHARGED) cannot be exercised at production cadence — every `apr code` invocation hits the 30 tok/s wall. What this contract specifies ============================ metadata.kind: kernel status: DRAFT scope: crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu, scheduler/moe_dispatch}.rs + crates/aprender-compute/src/gpu/moe_kernels.rs (TBD) equations: - moe_forward_one_layer_gpu (mirrors v1 CPU equation, +cosine-vs-CPU invariant, +CudaExecutor::new(0).is_ok() precondition) - gpu_throughput_target (≥150 tok/s on RTX 4090 over 128-tok median window, ≥5x CPU baseline) proof_obligations: 7 AC_GPU_MOE_001 cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC AC_GPU_MOE_002 router weights sum to 1.0 ± 1e-6 AC_GPU_MOE_003 output dimensions preserved AC_GPU_MOE_004 output finite (no NaN/Inf) AC_GPU_MOE_005 cosine ≥ 0.99 vs HF FP16 (inherits from v1) AC_GPU_MOE_006 ≥150 tok/s on RTX 4090 AC_GPU_MOE_007 VRAM utilization ≤ 95% of 24 GB falsification_tests: 7 FALSIFY-QW3-MOE-GPU-001 baseline (no GPU symbol) FALSIFY-QW3-MOE-GPU-PARITY-001 M-GPU-MOE-1 cosine vs CPU FALSIFY-QW3-MOE-GPU-PARITY-002 M-GPU-MOE-1 cosine vs HF FP16 FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s FALSIFY-QW3-MOE-GPU-MEMORY-001 ≤ 95% VRAM kani_harnesses: 2 KANI-QW3-MOE-GPU-001 router weights sum (AC_GPU_MOE_002) KANI-QW3-MOE-GPU-002 output shape preservation (AC_GPU_MOE_003) qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM) Implementation stages ===================== M-GPU-MOE-0 This contract scaffold SHIPPED M-GPU-MOE-1 CUDA kernel + cosine-vs-CPU parity gate PENDING M-GPU-MOE-2 wgpu fallback (CLAUDE.md backend-agnostic) PENDING M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING When all 3 PENDING stages discharge, status flips DRAFT → ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention). Verification ============ $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04) Refs claude-code-parity-apr POC R10 (risk row mirror) Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling) Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline) Refs arXiv:2305.18398 Dao FlashAttention-2 Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE Refs arXiv:2101.03961 Fedus Switch Transformers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(ship-two-models): v2.91.0 → v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED After §46 declared the v0.32.0 cut HOLD-gated on SHIP-007 layer-0 attention, the §46.7(a) follow-up cascade kicked off with three PRs in flight (#1450 + #1451 + #1452). ## What §47 records | Subsection | Content | |---|---| | 47.1 | 8-step cascade roadmap (this amendment captures steps 1-3) | | 47.2 | What landed in PRs #1450 + #1451 + #1452 | | 47.3 | **Toyota Way correction in detail** — v1.0.0 → v1.1.0 mid-cascade | | 47.4 | Pre-existing parent contract drift (QPostRope/KPostRope unwired) | | 47.5 | Net effects (ship %, coverage tally, pending merges) | | 47.6 | Open follow-ups (5-step ranked priority list) | | 47.7 | Five Whys (why amend at 3 PRs, why split §47/§48, etc.) | | 47.8 | Spec amendment cadence preserved (§41 → §47, 7 amendments) | ## Cascade roadmap | # | PR | What | Discharge status | |---|----|------|-------| | 1 | #1450 | Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 PROPOSED | 5 falsifiers algorithm-bound | | 2 | #1451 | Enum extension: 2 new SaveTensorStage variants | FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL | | 3 | #1452 | Research evidence note | No falsifier flip | | 4 | (next) | forward_traced_with_plan wires 4 sub-stages | FALSIFY-ATTN-SUB-002 + drift fix | | 5 | (next) | apr diff --values recognizes new stages | FALSIFY-ATTN-SUB-003 | | 6 | (next) | HF FP16 oracle script extension | unblocks FALSIFY-ATTN-SUB-004 | | 7 | (next) | Live RTX 4090 bisection | FALSIFY-ATTN-SUB-004 → DISCHARGED | | 8 | (next) | SHIP-007 root-cause fix | unblocks MODEL-1 GPU | §47 captures the first 3 (scaffold). §48+ will capture later steps. ## Toyota Way correction (mid-cascade) v1.0.0 of `trace-attn-sub-stages-v1.yaml` was the day's first defect. It claimed 5 new SaveTensorStage variants needed; live source inspection (per `feedback_no_guessing.md`) showed 3 already existed. v1.1.0 corrected to 2 truly-new variants + added the 9-element `bisection_chain_layer_0` equation. Cost-of-defect paid at the contract layer (cheapest place); no code rolled back. ## Pre-existing parent contract drift Researching the wire-plan for FALSIFY-ATTN-SUB-002 surfaced a drift in `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL): `QPostRope` + `KPostRope` are in the enum but have NO `emit()` calls in `forward_traced_with_plan`. A user passing `--save-tensor q_post_rope` gets a clean exit with no file written — silent failure. Per `feedback_toyota_way_all_defects.md`: all defects are mine. The next-cycle FALSIFY-ATTN-SUB-002 PR will close this drift as a free side-effect by wiring the 2 missing stages alongside the 2 new ones. ## Net effects - Spec v2.91.0 → **v2.92.0** - Coverage tally: unchanged this cycle (5 new PARTIAL_ALGORITHM_LEVEL slots will increment when PR #1450 lands the YAML) - MODEL-1 ship %: unchanged at 91% (cascade is scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% ## Five Whys (compressed) 1. Why amend at 3 PRs? §41-§46 cadence is "one amendment per ≥3-PR cycle" 2. Why split §47/§48? Toyota Way correction is worth pinning 3. Why pin parent drift here, not amend the parent contract? Drift fix lands in next-cycle implementation PR; §47 just records it 4. Why no FALSIFY-ATTN-SUB-002 in this cycle? Single-piece flow; stacked PRs slow merge throughput 5. Why no parent-contract bump now? Bump requires wire fix landing first (FUNCTIONAL claim) — cleaner to bump in next-cycle PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…ith_plan — FALSIFY-ATTN-SUB-002 (#1455) * contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450). Adds the 2 new attention sub-stage variants to `SaveTensorStage`: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask - `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply Closes the SHIP-007 layer-0 attention bisection gap inside the Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out ## What changed | File | Change | |---|---| | `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) | | `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) | | `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 | ## Test results - `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed** - 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`, `falsify_attn_sub_001_attn_softmax_round_trip`, `falsify_attn_sub_001_2_new_stages_in_canonical_order`, `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`, `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain` - `cargo check --workspace --lib` — clean ## Falsifier discharge | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) | Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the contract is still PROPOSED upstream. ## Five Whys 1. **Why this PR before #1450 lands?** Contract+impl can land together — #1450 introduces the contract, this PR provides the first implementation evidence. They reference each other and merge in either order without conflict. 2. **Why only the enum + tests, not `forward_traced_with_plan`?** Enum extension is the smallest atomic ticket per Toyota Way (one mechanism per PR). Threading the new variants through forward capture is the next PR (FALSIFY-ATTN-SUB-002 discharge). 3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention in `ALL`?** That's the canonical computation order pinned by the contract's ordering proof_obligation: `QkvBias → QPostRope → KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`. 4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`, not a separate variant. The enum has 20 distinct variants; `ALL` excludes the alias only at the `FromStr` layer. 5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain` test?** The contract's `bisection_chain_layer_0` equation pins the 9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004. This test pins the parser side of that gate so a future drift in stage names breaks loudly. ## Net effects - 2 new `SaveTensorStage` variants land - 5 new tests pin the variants + ordering + parser - MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007 bisection cascade; ship % moves when a falsifier flips DISCHARGED) - MODEL-2 ship %: unchanged at 57% - Implementation cascade ready to thread variants through `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451 merges to main, this PR rebases cleanly and lands as a 4-stage wire fix. ## What this PR wires | Stage | Existed in enum? | emit() existed? | After this PR | |---|---|---|---| | QPostRope | YES | NO | YES (new emit) | | KPostRope | YES | NO | YES (new emit) | | AttnScores | NEW (#1451) | NO | YES (new emit + accumulator) | | AttnSoftmax | NEW (#1451) | NO | YES (new emit + accumulator) | Closes the parent-contract drift discovered in PR #1452 research evidence: QPostRope + KPostRope were in the SaveTensorStage enum but had no emit() calls in forward_traced_with_plan. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated coverage for those 2 stages. This PR closes the drift as a side-effect. ## Implementation details **QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the inner loop populates them. Tensors already exist; this is just 2 emit() calls — zero new allocation. **AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside the inner softmax loop, populate per (head, i, j) — zero overhead when plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005: additive purity). Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation scales O(num_heads * seq^2) and is gated by plan. ## Test results - `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed, 0 failed, 51 ignored** - `cargo check -p aprender-serve --lib` — clean - inference_trace tests: 167/167 PASS - (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change) ## Falsifier discharge map | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() | ## Five Whys 1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing gaps in the parent contract; the same-file fix is a free side-effect per Toyota Way "all defects are mine". 2. **Why allocate accumulators only when requested?** O(num_heads * seq^2) memory shouldn't be paid on the default forward path. Plan-gating keeps the production inference path zero-overhead. 3. **Why insert capture at lines 133, 152, 160 specifically?** Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`: line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after scale (pre-softmax), line 160 = post-softmax probs. 4. **Why use scores_all.is_some() check vs always-allocate?** Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer regardless of capture. Some(Vec) idiom plus is_some_and check is the idiomatic Rust pattern for conditional capture. 5. **Why this PR stacked on #1451 rather than off main?** Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only exist on #1451's branch. When #1451 merges, this rebases to main as a clean 51-line delta. ## Net effects - 4 stages now wired in `forward_traced_with_plan` - MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% - Cascade step 4/8 of §47.1 roadmap delivered 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…tion cascade ALGORITHM-LEVEL COMPLETE (#1458) After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452 scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap steps 4-6 at the algorithm level via PRs #1455, #1456, #1457. ## What landed (§47.1 cascade roadmap) | Step | PR | Discharge | |------|----|-----------| | 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect | | 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) | | 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` | ## Toyota Way correction during research (PR #1457) The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**. Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines). ## Steps 7-8 require operator action | Step | Blocker | Workaround | |------|---------|-----------| | 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). | | 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. | ## Net effects - Spec v2.92.0 → **v2.93.0**. - §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated. - Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships. - **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 4, 2026 03:51

Merge branch 'main' into evidence/ship-007-layer0-attn-research

a9fccb0

noahgift mentioned this pull request May 4, 2026

spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED #1454

Merged

1 task

noahgift enabled auto-merge (squash) May 4, 2026 02:27

Merge branch 'main' into evidence/ship-007-layer0-attn-research

bdc3755

noahgift mentioned this pull request May 4, 2026

feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 #1455

Merged

4 tasks

Merge branch 'main' into evidence/ship-007-layer0-attn-research

7cb4bc6

Merge branch 'main' into evidence/ship-007-layer0-attn-research

397d5cd

Merge branch 'main' into evidence/ship-007-layer0-attn-research

5f15e86

noahgift mentioned this pull request May 4, 2026

test(apr-cli): FALSIFY-ATTN-SUB-003 — apr diff --values per-stage-agnostic for attn_scores + attn_softmax #1456

Merged

4 tasks

Merge branch 'main' into evidence/ship-007-layer0-attn-research

d00d8a7

noahgift mentioned this pull request May 4, 2026

feat(scripts): HF FP16 oracle extension — capture 4 attention sub-stages (q/k_post_rope, attn_scores, attn_softmax) #1457

Merged

6 tasks

noahgift added 2 commits May 4, 2026 07:06

Merge branch 'main' into evidence/ship-007-layer0-attn-research

9d48ee0

Merge branch 'main' into evidence/ship-007-layer0-attn-research

f72e003

noahgift merged commit 9df2dfb into main May 4, 2026
10 checks passed

noahgift deleted the evidence/ship-007-layer0-attn-research branch May 4, 2026 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(ship-007): research note — forward_traced_with_plan has 2 pre-existing capture gaps (QPostRope + KPostRope)#1452

evidence(ship-007): research note — forward_traced_with_plan has 2 pre-existing capture gaps (QPostRope + KPostRope)#1452
noahgift merged 9 commits into
mainfrom
evidence/ship-007-layer0-attn-research

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Discovery

What FALSIFY-ATTN-SUB-002 will wire

Why an evidence file, not a 5th stacked PR

Five Whys + cross-references

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant