feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 by noahgift · Pull Request #1455 · paiml/aprender

noahgift · 2026-05-04T02:45:11Z

Summary

Wires 4 attention sub-stages into `forward_traced_with_plan` per `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 + closes pre-existing parent-contract drift discovered in PR #1452 research note.

Stacked on #1451 (needs SaveTensorStage::AttnScores + AttnSoftmax variants).

What this PR wires

Stage	Existed in enum?	emit() existed?	After this PR
QPostRope	YES	NO	YES (new emit)
KPostRope	YES	NO	YES (new emit)
AttnScores	NEW (#1451)	NO	YES (new emit + accumulator)
AttnSoftmax	NEW (#1451)	NO	YES (new emit + accumulator)

Closes the parent-contract drift discovered in PR #1452: QPostRope + KPostRope were in the enum but had no emit() calls.

Test results

`cargo test -p aprender-serve --lib -- --skip "gpu::"` — 13944 passed, 0 failed, 51 ignored
`cargo check -p aprender-serve --lib` — clean
inference_trace tests: 167/167 PASS

Implementation

QPostRope/KPostRope: 2 emit() calls after line 133 (zero new allocation, tensors already exist)
AttnScores/AttnSoftmax: allocator-gated by `plan.should_save()` per FALSIFY-ATTN-SUB-005 (additive purity). Zero overhead when plan is None.

For BOS forward (seq=1) on Qwen2.5-Coder-7B: 112 bytes total accumulator. Negligible.

Cascade step 4/8

Per spec §47.1 roadmap. After this lands:

5/8: `apr diff --values` recognition (FALSIFY-ATTN-SUB-003)
6/8: HF FP16 oracle script extension
7/8: Live RTX 4090 bisection → FALSIFY-ATTN-SUB-004 DISCHARGED
8/8: SHIP-007 root-cause fix

Test plan

`cargo test -p aprender-serve --lib -- --skip gpu::` — 13944 PASS
`cargo check --workspace --lib` — clean
CI green on required gates
(post-merge) live RTX 4090 smoke writes 4 APRT files for layer 0 BOS

🤖 Generated with Claude Code

…ion (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@v

…ection (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…— FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450). Adds the 2 new attention sub-stage variants to `SaveTensorStage`: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask - `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply Closes the SHIP-007 layer-0 attention bisection gap inside the Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out ## What changed | File | Change | |---|---| | `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) | | `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) | | `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 | ## Test results - `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed** - 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`, `falsify_attn_sub_001_attn_softmax_round_trip`, `falsify_attn_sub_001_2_new_stages_in_canonical_order`, `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`, `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain` - `cargo check --workspace --lib` — clean ## Falsifier discharge | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) | Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the contract is still PROPOSED upstream. ## Five Whys 1. **Why this PR before #1450 lands?** Contract+impl can land together — #1450 introduces the contract, this PR provides the first implementation evidence. They reference each other and merge in either order without conflict. 2. **Why only the enum + tests, not `forward_traced_with_plan`?** Enum extension is the smallest atomic ticket per Toyota Way (one mechanism per PR). Threading the new variants through forward capture is the next PR (FALSIFY-ATTN-SUB-002 discharge). 3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention in `ALL`?** That's the canonical computation order pinned by the contract's ordering proof_obligation: `QkvBias → QPostRope → KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`. 4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`, not a separate variant. The enum has 20 distinct variants; `ALL` excludes the alias only at the `FromStr` layer. 5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain` test?** The contract's `bisection_chain_layer_0` equation pins the 9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004. This test pins the parser side of that gate so a future drift in stage names breaks loudly. ## Net effects - 2 new `SaveTensorStage` variants land - 5 new tests pin the variants + ordering + parser - MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007 bisection cascade; ship % moves when a falsifier flips DISCHARGED) - MODEL-2 ship %: unchanged at 57% - Implementation cascade ready to thread variants through `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ith_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451 merges to main, this PR rebases cleanly and lands as a 4-stage wire fix. ## What this PR wires | Stage | Existed in enum? | emit() existed? | After this PR | |---|---|---|---| | QPostRope | YES | NO | YES (new emit) | | KPostRope | YES | NO | YES (new emit) | | AttnScores | NEW (#1451) | NO | YES (new emit + accumulator) | | AttnSoftmax | NEW (#1451) | NO | YES (new emit + accumulator) | Closes the parent-contract drift discovered in PR #1452 research evidence: QPostRope + KPostRope were in the SaveTensorStage enum but had no emit() calls in forward_traced_with_plan. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated coverage for those 2 stages. This PR closes the drift as a side-effect. ## Implementation details **QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the inner loop populates them. Tensors already exist; this is just 2 emit() calls — zero new allocation. **AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside the inner softmax loop, populate per (head, i, j) — zero overhead when plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005: additive purity). Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation scales O(num_heads * seq^2) and is gated by plan. ## Test results - `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed, 0 failed, 51 ignored** - `cargo check -p aprender-serve --lib` — clean - inference_trace tests: 167/167 PASS - (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change) ## Falsifier discharge map | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() | ## Five Whys 1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing gaps in the parent contract; the same-file fix is a free side-effect per Toyota Way "all defects are mine". 2. **Why allocate accumulators only when requested?** O(num_heads * seq^2) memory shouldn't be paid on the default forward path. Plan-gating keeps the production inference path zero-overhead. 3. **Why insert capture at lines 133, 152, 160 specifically?** Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`: line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after scale (pre-softmax), line 160 = post-softmax probs. 4. **Why use scores_all.is_some() check vs always-allocate?** Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer regardless of capture. Some(Vec) idiom plus is_some_and check is the idiomatic Rust pattern for conditional capture. 5. **Why this PR stacked on #1451 rather than off main?** Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only exist on #1451's branch. When #1451 merges, this rebases to main as a clean 51-line delta. ## Net effects - 4 stages now wired in `forward_traced_with_plan` - MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% - Cascade step 4/8 of §47.1 roadmap delivered 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion cascade ALGORITHM-LEVEL COMPLETE (#1458) After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452 scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap steps 4-6 at the algorithm level via PRs #1455, #1456, #1457. ## What landed (§47.1 cascade roadmap) | Step | PR | Discharge | |------|----|-----------| | 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect | | 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) | | 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` | ## Toyota Way correction during research (PR #1457) The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**. Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines). ## Steps 7-8 require operator action | Step | Blocker | Workaround | |------|---------|-----------| | 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). | | 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. | ## Net effects - Spec v2.92.0 → **v2.93.0**. - §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated. - Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships. - **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…x + SUB-004 BLOCKER → PARTIAL_ALGORITHM_LEVEL (#1459) * contract(trace-attn-sub-stages-v1): v1.1.0 → v1.2.0 — function-name drift fix in SUB-003 algorithm_evidence ## Why Contract drift discovered after PR #1456 (FALSIFY-ATTN-SUB-003 drift-prevention test) merged on main. The algorithm_evidence block named: ```yaml function_names: - load_tensor_apr_aprt ``` But this function does not exist anywhere in the codebase. The actual functions wired in `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` and exercised by PR #1456's tests are: - `is_aprt_stage_file` (magic-byte detection) - `compute_aprt_stage_stats` (cosine + RMS + top-K) - `run_aprt_stage_diff` (e2e reader + emitter) Per `feedback_no_guessing.md`. Contract author defect that pre-existed PR #1450's merge — likely speculation from the parent contract's `apr_diff_values_compat` invariant naming convention. Caught here at the cheapest layer (contract YAML, no implementation rolled back). ## What landed - Bumped `metadata.version` 1.1.0 → 1.2.0 with v1.2.0 changelog block describing the fix. - Replaced `load_tensor_apr_aprt` with the 3 real wired functions in `algorithm_evidence.function_names`. - Added `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` to `algorithm_evidence.file_paths` (the actual location of the wired functions). - Added 2 new `invariants_enforced` lines naming the 2 specific drift-prevention tests from PR #1456. - Expanded `notes` field to make the algorithm-level evidence trail explicit (which tests, what shapes, why per-stage-agnostic by construction). ## Test plan - [x] `pv validate contracts/trace-attn-sub-stages-v1.yaml` reports `0 error(s), 0 warning(s) — Contract is valid.` - [ ] CI green - [ ] Auto-merge ## Five whys 1. **Why now and not in §47/§48?** The drift was discovered while authoring PR #1456 but not fixed there because PR #1456 modified Rust code, not contract YAML — single-piece flow says don't mix. Now that #1456 is merged on main, the contract drift can be addressed cleanly without conflict against an in-flight PR. 2. **Why a separate PR rather than in PR #1457?** PR #1457 is the HF FP16 oracle script extension (Python-only). Modifying the contract there would couple two independent fixes. This PR is contract-only YAML and lands independently. 3. **Why bump to v1.2.0 rather than v1.1.1?** Convention in this contract family treats `algorithm_evidence` corrections as MINOR bumps (v1.0.0 → v1.1.0 for the Toyota Way scope correction, also algorithm_evidence-level). v1.1.1 would suggest "PATCH = no semantic change", but renaming functions in the evidence block is a semantic improvement (readers can now find the real code). 4. **Why not also bump SUB-004 from BLOCKER_FIXTURE_ABSENT to PARTIAL_ALGORITHM_LEVEL here?** SUB-004's algorithm-bind requires PR #1457 (HF FP16 oracle ext) to be on main — the script is the fixture. PR #1457 is in flight. Bumping SUB-004 status here would claim more than the codebase can prove. Keeping single-piece flow: this PR ships the SUB-003 drift fix only. 5. **Why is the loader genuinely per-stage-agnostic?** `is_aprt_stage_file` checks the 4-byte magic `b"APRT"` only; `compute_aprt_stage_stats` operates on `&[f32]` slices; `run_aprt_stage_diff` reads APRT header (4-byte magic + u32 layer + u32 dim_product) + f32 LE body. Stage names are encoded only in the OUTPUT FILENAME (e.g., `layer_0_attn_scores.aprt`), never in the binary content. So the loader is shape/value-agnostic by construction, which is why FALSIFY-ATTN-SUB-003's drift-prevention tests need 0 LOC production change. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003 algorithm_evidence now correctly names the wired functions. - **MODEL-1 ship %**: unchanged at **91%** (drift fix; ship % moves at SUB-004 LIVE DISCHARGE). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 6 commits May 4, 2026 03:24

Merge branch 'main' into feat/attn-sub-stages-impl

a6105a2

Merge branch 'main' into feat/attn-sub-stages-impl

7c10db3

noahgift enabled auto-merge (squash) May 4, 2026 02:50

noahgift added 2 commits May 4, 2026 05:08

Merge branch 'main' into feat/attn-sub-stages-002-impl

77c16de

Merge branch 'main' into feat/attn-sub-stages-002-impl

0a1b0bd

noahgift merged commit f6acfc8 into main May 4, 2026
10 checks passed

noahgift deleted the feat/attn-sub-stages-002-impl branch May 4, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002#1455

feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002#1455
noahgift merged 8 commits into
mainfrom
feat/attn-sub-stages-002-impl

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

What this PR wires

Test results

Implementation

Cascade step 4/8

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant