feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002#1455
Merged
Merged
Conversation
…ion (5 new SaveTensorStage variants) Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED that pre-commits to the schema for extending `SaveTensorStage` with FIVE new intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence can be bisected element-wise against the HF FP16 oracle (PR #1423). ## Why now (per spec §46.7) Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest- leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`: - cos(APR.attn_norm, HF.attn_norm) = 0.99999995 ✓ (correct) - cos(APR.attn_out, HF.attn_out) = 0.9966 ✗ (wrong) The bug is somewhere INSIDE the attention block. The existing `SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` — too coarse to localize. ## What this contract pins 5 new variants, in computation order inside the attention block: | New stage | What it captures | |---|---| | `QPostRope` | Q after RoPE (post Q-projection + RoPE rotate) | | `KPostRope` | K after RoPE (GQA: shared across head groups) | | `AttnScores` | Q·Kᵀ / sqrt(head_dim), pre-softmax | | `AttnSoftmax` | softmax(scores + causal_mask) | | `AttnVOut` | softmax · V (pre output O-projection) | Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut` ## Falsifiers (5) | ID | What it predicts | Status | |---|---|---| | FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL | | FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT | | FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL | FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must be falsified to actually pinpoint the SHIP-007 sub-stage. Marked BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on RTX 4090. This contract pins the gate; the implementation cascade follows. ## Five Whys 1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?** The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it. Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) — sub-block contracts are siblings of the parent, not amendments. 2. **Why pin the schema before implementation?** Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity → extend the enum behind a contract." Contract-first preserves the audit chain spec § → contract → implementation PRs → live discharge. 3. **Why these 5 stages and not 3 or 7?** The 5 capture points bracket every numerically distinct intermediate inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope, scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj). Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is premature — let the bisection localize first, then refine if needed. 4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?** PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today. ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle extension; today neither exists. BLOCKER honestly classifies the gap; matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443). 5. **Why is this not just SHIP-007's fix itself?** Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract delivers the *measurement instrument* that pinpoints the sub-stage; the fix is the next PR cascade after that pin lands. ## Net effects - New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers. - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0. - MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips). - MODEL-2 ship %: unchanged at 57%. - Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract is new — they count once it''s wired into the §-amendment chain). - Unblocks the next PR cascade: enum extension + forward_traced threading + apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005 algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ection (only 2 new variants needed, not 5) ## What's wrong with v1.0.0 v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants were needed for the SHIP-007 layer-0 attention bisection: QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut. Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs` shows THREE of those five ALREADY EXIST in the parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL: - `QPostRope` — already in enum (line 47) - `KPostRope` — already in enum (line 49) - `Attention` — already in enum (line 51), semantically my "AttnVOut" ("post softmax(Q@Kᵀ)@v, pre O-proj") Only TWO are truly missing: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax - `AttnSoftmax` — softmax(scores + causal_mask), pre-V ## Why it happened Per `feedback_no_guessing.md`: should have run `pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I extrapolated from the parent contract description without reading the live enum source. Toyota Way andon — caught on next iteration. Per `feedback_toyota_way_all_defects.md`: all defects are mine. Fixing at the contract level BEFORE any implementation PR depends on the wrong scope is exactly the cost-of-defect minimization the toolchain is designed for. ## What v1.1.0 does - Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL) - Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax - Documents the FULL 9-stage layer-0 bisection chain spanning parent-contract stages + 2 new ones: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out - Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope - Adds bisection_chain_layer_0 equation pinning the 9-element cosine sequence (with empirical state per memory `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966) - FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF FP16 oracle extension to capture 2 new stages on RTX 4090) ## Five Whys 1. **Why did v1.0.0 claim 5 new variants?** Authored without reading the live save_tensor_stage.rs source. 2. **Why didn't I read the source first?** Skipped the `pmat query SaveTensorStage` step that `feedback_no_guessing.md` mandates. Worked from the parent contract description's prose ("Embedding, AttnNorm, QkvMatmul, AttnOut, ...") which truncated 18 stages to 14. 3. **Why was the parent contract description truncated?** Doc-comment in `forward_traced_with_plan` rust source listed only 14 stages (the per-layer canonical-FFN order, omitting QkvBias + the parent's renamed Attention). My contract reused that prose instead of reading the enum directly. 4. **Why does this matter for SHIP-007 ship %?** It doesn't yet — the contract is still scaffold scope, no implementation PR has shipped against the wrong scope. v1.1.0 correction lands BEFORE the cascade triggers. 5. **Why amend the contract instead of opening a sibling fix-PR?** Same branch (#1450) is the right place. Toyota Way: stop the line, fix the defect at source, then continue. A sibling PR would split the audit story across two commits with no benefit. ## Net effects - Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED** - `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0 - MODEL-1 ship %: unchanged at 91% (this is contract correction) - MODEL-2 ship %: unchanged at 57% - Implementation cascade now correctly scoped to 2 new variants, not 5 — saves an estimated 60% of the enum-extension PR's LOC 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…— FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450). Adds the 2 new attention sub-stage variants to `SaveTensorStage`: - `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask - `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply Closes the SHIP-007 layer-0 attention bisection gap inside the Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now: attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out ## What changed | File | Change | |---|---| | `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) | | `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) | | `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 | ## Test results - `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed** - 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`, `falsify_attn_sub_001_attn_softmax_round_trip`, `falsify_attn_sub_001_2_new_stages_in_canonical_order`, `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`, `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain` - `cargo check --workspace --lib` — clean ## Falsifier discharge | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) | Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the contract is still PROPOSED upstream. ## Five Whys 1. **Why this PR before #1450 lands?** Contract+impl can land together — #1450 introduces the contract, this PR provides the first implementation evidence. They reference each other and merge in either order without conflict. 2. **Why only the enum + tests, not `forward_traced_with_plan`?** Enum extension is the smallest atomic ticket per Toyota Way (one mechanism per PR). Threading the new variants through forward capture is the next PR (FALSIFY-ATTN-SUB-002 discharge). 3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention in `ALL`?** That's the canonical computation order pinned by the contract's ordering proof_obligation: `QkvBias → QPostRope → KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`. 4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`, not a separate variant. The enum has 20 distinct variants; `ALL` excludes the alias only at the `FromStr` layer. 5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain` test?** The contract's `bisection_chain_layer_0` equation pins the 9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004. This test pins the parser side of that gate so a future drift in stage names breaks loudly. ## Net effects - 2 new `SaveTensorStage` variants land - 5 new tests pin the variants + ordering + parser - MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007 bisection cascade; ship % moves when a falsifier flips DISCHARGED) - MODEL-2 ship %: unchanged at 57% - Implementation cascade ready to thread variants through `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ith_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451 merges to main, this PR rebases cleanly and lands as a 4-stage wire fix. ## What this PR wires | Stage | Existed in enum? | emit() existed? | After this PR | |---|---|---|---| | QPostRope | YES | NO | YES (new emit) | | KPostRope | YES | NO | YES (new emit) | | AttnScores | NEW (#1451) | NO | YES (new emit + accumulator) | | AttnSoftmax | NEW (#1451) | NO | YES (new emit + accumulator) | Closes the parent-contract drift discovered in PR #1452 research evidence: QPostRope + KPostRope were in the SaveTensorStage enum but had no emit() calls in forward_traced_with_plan. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated coverage for those 2 stages. This PR closes the drift as a side-effect. ## Implementation details **QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the inner loop populates them. Tensors already exist; this is just 2 emit() calls — zero new allocation. **AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside the inner softmax loop, populate per (head, i, j) — zero overhead when plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005: additive purity). Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation scales O(num_heads * seq^2) and is gated by plan. ## Test results - `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed, 0 failed, 51 ignored** - `cargo check -p aprender-serve --lib` — clean - inference_trace tests: 167/167 PASS - (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change) ## Falsifier discharge map | ID | Status before | Status after | Why | |---|---|---|---| | FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order | | FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() | ## Five Whys 1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing gaps in the parent contract; the same-file fix is a free side-effect per Toyota Way "all defects are mine". 2. **Why allocate accumulators only when requested?** O(num_heads * seq^2) memory shouldn't be paid on the default forward path. Plan-gating keeps the production inference path zero-overhead. 3. **Why insert capture at lines 133, 152, 160 specifically?** Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`: line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after scale (pre-softmax), line 160 = post-softmax probs. 4. **Why use scores_all.is_some() check vs always-allocate?** Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer regardless of capture. Some(Vec) idiom plus is_some_and check is the idiomatic Rust pattern for conditional capture. 5. **Why this PR stacked on #1451 rather than off main?** Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only exist on #1451's branch. When #1451 merges, this rebases to main as a clean 51-line delta. ## Net effects - 4 stages now wired in `forward_traced_with_plan` - MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% - Cascade step 4/8 of §47.1 roadmap delivered 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 4, 2026
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…tion cascade ALGORITHM-LEVEL COMPLETE (#1458) After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452 scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap steps 4-6 at the algorithm level via PRs #1455, #1456, #1457. ## What landed (§47.1 cascade roadmap) | Step | PR | Discharge | |------|----|-----------| | 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect | | 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) | | 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` | ## Toyota Way correction during research (PR #1457) The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**. Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines). ## Steps 7-8 require operator action | Step | Blocker | Workaround | |------|---------|-----------| | 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). | | 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. | ## Net effects - Spec v2.92.0 → **v2.93.0**. - §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated. - Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships. - **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…x + SUB-004 BLOCKER → PARTIAL_ALGORITHM_LEVEL (#1459) * contract(trace-attn-sub-stages-v1): v1.1.0 → v1.2.0 — function-name drift fix in SUB-003 algorithm_evidence ## Why Contract drift discovered after PR #1456 (FALSIFY-ATTN-SUB-003 drift-prevention test) merged on main. The algorithm_evidence block named: ```yaml function_names: - load_tensor_apr_aprt ``` But this function does not exist anywhere in the codebase. The actual functions wired in `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` and exercised by PR #1456's tests are: - `is_aprt_stage_file` (magic-byte detection) - `compute_aprt_stage_stats` (cosine + RMS + top-K) - `run_aprt_stage_diff` (e2e reader + emitter) Per `feedback_no_guessing.md`. Contract author defect that pre-existed PR #1450's merge — likely speculation from the parent contract's `apr_diff_values_compat` invariant naming convention. Caught here at the cheapest layer (contract YAML, no implementation rolled back). ## What landed - Bumped `metadata.version` 1.1.0 → 1.2.0 with v1.2.0 changelog block describing the fix. - Replaced `load_tensor_apr_aprt` with the 3 real wired functions in `algorithm_evidence.function_names`. - Added `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` to `algorithm_evidence.file_paths` (the actual location of the wired functions). - Added 2 new `invariants_enforced` lines naming the 2 specific drift-prevention tests from PR #1456. - Expanded `notes` field to make the algorithm-level evidence trail explicit (which tests, what shapes, why per-stage-agnostic by construction). ## Test plan - [x] `pv validate contracts/trace-attn-sub-stages-v1.yaml` reports `0 error(s), 0 warning(s) — Contract is valid.` - [ ] CI green - [ ] Auto-merge ## Five whys 1. **Why now and not in §47/§48?** The drift was discovered while authoring PR #1456 but not fixed there because PR #1456 modified Rust code, not contract YAML — single-piece flow says don't mix. Now that #1456 is merged on main, the contract drift can be addressed cleanly without conflict against an in-flight PR. 2. **Why a separate PR rather than in PR #1457?** PR #1457 is the HF FP16 oracle script extension (Python-only). Modifying the contract there would couple two independent fixes. This PR is contract-only YAML and lands independently. 3. **Why bump to v1.2.0 rather than v1.1.1?** Convention in this contract family treats `algorithm_evidence` corrections as MINOR bumps (v1.0.0 → v1.1.0 for the Toyota Way scope correction, also algorithm_evidence-level). v1.1.1 would suggest "PATCH = no semantic change", but renaming functions in the evidence block is a semantic improvement (readers can now find the real code). 4. **Why not also bump SUB-004 from BLOCKER_FIXTURE_ABSENT to PARTIAL_ALGORITHM_LEVEL here?** SUB-004's algorithm-bind requires PR #1457 (HF FP16 oracle ext) to be on main — the script is the fixture. PR #1457 is in flight. Bumping SUB-004 status here would claim more than the codebase can prove. Keeping single-piece flow: this PR ships the SUB-003 drift fix only. 5. **Why is the loader genuinely per-stage-agnostic?** `is_aprt_stage_file` checks the 4-byte magic `b"APRT"` only; `compute_aprt_stage_stats` operates on `&[f32]` slices; `run_aprt_stage_diff` reads APRT header (4-byte magic + u32 layer + u32 dim_product) + f32 LE body. Stage names are encoded only in the OUTPUT FILENAME (e.g., `layer_0_attn_scores.aprt`), never in the binary content. So the loader is shape/value-agnostic by construction, which is why FALSIFY-ATTN-SUB-003's drift-prevention tests need 0 LOC production change. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003 algorithm_evidence now correctly names the wired functions. - **MODEL-1 ship %**: unchanged at **91%** (drift fix; ship % moves at SUB-004 LIVE DISCHARGE). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires 4 attention sub-stages into `forward_traced_with_plan` per `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 + closes pre-existing parent-contract drift discovered in PR #1452 research note.
Stacked on #1451 (needs SaveTensorStage::AttnScores + AttnSoftmax variants).
What this PR wires
Closes the parent-contract drift discovered in PR #1452: QPostRope + KPostRope were in the enum but had no emit() calls.
Test results
Implementation
For BOS forward (seq=1) on Qwen2.5-Coder-7B: 112 bytes total accumulator. Negligible.
Cascade step 4/8
Per spec §47.1 roadmap. After this lands:
Test plan
🤖 Generated with Claude Code