Skip to content

feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL#1451

Merged
noahgift merged 5 commits into
mainfrom
feat/attn-sub-stages-impl
May 4, 2026
Merged

feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL#1451
noahgift merged 5 commits into
mainfrom
feat/attn-sub-stages-impl

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements contracts/trace-attn-sub-stages-v1.yaml v1.1.0 (PROPOSED, in PR #1450).

Adds the 2 new attention sub-stage variants to `SaveTensorStage`:

  • `AttnScores` — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask
  • `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply

Closes the SHIP-007 layer-0 attention bisection gap. The 9-stage capture chain is now:

`attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out`

Test results

  • `cargo test -p aprender-serve --lib inference_trace` — 167 passed, 0 failed
  • 5 new FALSIFY-ATTN-SUB-001 tests for round-trip, ordering, parser-list
  • `cargo check --workspace --lib` — clean

Stack pattern

Branched from `contract/trace-attn-sub-stages-v1` (#1450). Either order merges cleanly.

Test plan

  • `cargo test -p aprender-serve --lib inference_trace` 167 PASS locally
  • CI green on required gates

🤖 Generated with Claude Code

noahgift and others added 3 commits May 4, 2026 03:24
…ion (5 new SaveTensorStage variants)

Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED
that pre-commits to the schema for extending `SaveTensorStage` with FIVE new
intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence
can be bisected element-wise against the HF FP16 oracle (PR #1423).

## Why now (per spec §46.7)

Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest-
leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`:

- cos(APR.attn_norm, HF.attn_norm) = 0.99999995  ✓ (correct)
- cos(APR.attn_out,  HF.attn_out)  = 0.9966      ✗ (wrong)

The bug is somewhere INSIDE the attention block. The existing
`SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` —
too coarse to localize.

## What this contract pins

5 new variants, in computation order inside the attention block:

| New stage | What it captures |
|---|---|
| `QPostRope`   | Q after RoPE (post Q-projection + RoPE rotate) |
| `KPostRope`   | K after RoPE (GQA: shared across head groups) |
| `AttnScores`  | Q·Kᵀ / sqrt(head_dim), pre-softmax |
| `AttnSoftmax` | softmax(scores + causal_mask) |
| `AttnVOut`    | softmax · V (pre output O-projection) |

Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut`

## Falsifiers (5)

| ID | What it predicts | Status |
|---|---|---|
| FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT |
| FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL |

FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must
be falsified to actually pinpoint the SHIP-007 sub-stage. Marked
BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages
implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on
RTX 4090. This contract pins the gate; the implementation cascade follows.

## Five Whys

1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?**
   The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it.
   Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) —
   sub-block contracts are siblings of the parent, not amendments.

2. **Why pin the schema before implementation?**
   Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity →
   extend the enum behind a contract." Contract-first preserves the audit
   chain spec § → contract → implementation PRs → live discharge.

3. **Why these 5 stages and not 3 or 7?**
   The 5 capture points bracket every numerically distinct intermediate
   inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope,
   scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj).
   Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is
   premature — let the bisection localize first, then refine if needed.

4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?**
   PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today.
   ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle
   extension; today neither exists. BLOCKER honestly classifies the gap;
   matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443).

5. **Why is this not just SHIP-007's fix itself?**
   Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract
   delivers the *measurement instrument* that pinpoints the sub-stage; the
   fix is the next PR cascade after that pin lands.

## Net effects

- New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers.
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0.
- MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips).
- MODEL-2 ship %: unchanged at 57%.
- Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract
  is new — they count once it''s wired into the §-amendment chain).
- Unblocks the next PR cascade: enum extension + forward_traced threading +
  apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005
  algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ection (only 2 new variants needed, not 5)

## What's wrong with v1.0.0

v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants
were needed for the SHIP-007 layer-0 attention bisection:
QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut.

Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs`
shows THREE of those five ALREADY EXIST in the parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL:

- `QPostRope`  — already in enum (line 47)
- `KPostRope`  — already in enum (line 49)
- `Attention`  — already in enum (line 51), semantically my "AttnVOut"
                 ("post softmax(Q@Kᵀ)@v, pre O-proj")

Only TWO are truly missing:

- `AttnScores`   — Q·Kᵀ / sqrt(head_dim), pre-softmax
- `AttnSoftmax`  — softmax(scores + causal_mask), pre-V

## Why it happened

Per `feedback_no_guessing.md`: should have run
`pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I
extrapolated from the parent contract description without reading
the live enum source. Toyota Way andon — caught on next iteration.

Per `feedback_toyota_way_all_defects.md`: all defects are mine.
Fixing at the contract level BEFORE any implementation PR depends
on the wrong scope is exactly the cost-of-defect minimization
the toolchain is designed for.

## What v1.1.0 does

- Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL)
- Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax
- Documents the FULL 9-stage layer-0 bisection chain spanning
  parent-contract stages + 2 new ones:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

- Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope
- Adds bisection_chain_layer_0 equation pinning the 9-element
  cosine sequence (with empirical state per memory
  `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966)
- FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF
  FP16 oracle extension to capture 2 new stages on RTX 4090)

## Five Whys

1. **Why did v1.0.0 claim 5 new variants?**
   Authored without reading the live save_tensor_stage.rs source.

2. **Why didn't I read the source first?**
   Skipped the `pmat query SaveTensorStage` step that
   `feedback_no_guessing.md` mandates. Worked from the parent
   contract description's prose ("Embedding, AttnNorm, QkvMatmul,
   AttnOut, ...") which truncated 18 stages to 14.

3. **Why was the parent contract description truncated?**
   Doc-comment in `forward_traced_with_plan` rust source listed
   only 14 stages (the per-layer canonical-FFN order, omitting
   QkvBias + the parent's renamed Attention). My contract reused
   that prose instead of reading the enum directly.

4. **Why does this matter for SHIP-007 ship %?**
   It doesn't yet — the contract is still scaffold scope, no
   implementation PR has shipped against the wrong scope. v1.1.0
   correction lands BEFORE the cascade triggers.

5. **Why amend the contract instead of opening a sibling fix-PR?**
   Same branch (#1450) is the right place. Toyota Way: stop the
   line, fix the defect at source, then continue. A sibling PR
   would split the audit story across two commits with no benefit.

## Net effects

- Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED**
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0
- MODEL-1 ship %: unchanged at 91% (this is contract correction)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade now correctly scoped to 2 new variants,
  not 5 — saves an estimated 60% of the enum-extension PR's LOC

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…— FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL

Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450).

Adds the 2 new attention sub-stage variants to `SaveTensorStage`:

- `AttnScores`  — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask
- `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply

Closes the SHIP-007 layer-0 attention bisection gap inside the
Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

## What changed

| File | Change |
|---|---|
| `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) |
| `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) |
| `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 |

## Test results

- `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed**
- 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`,
  `falsify_attn_sub_001_attn_softmax_round_trip`,
  `falsify_attn_sub_001_2_new_stages_in_canonical_order`,
  `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`,
  `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain`
- `cargo check --workspace --lib` — clean

## Falsifier discharge

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) |

Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in
`contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this
PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the
contract is still PROPOSED upstream.

## Five Whys

1. **Why this PR before #1450 lands?** Contract+impl can land
   together — #1450 introduces the contract, this PR provides the
   first implementation evidence. They reference each other and merge
   in either order without conflict.

2. **Why only the enum + tests, not `forward_traced_with_plan`?**
   Enum extension is the smallest atomic ticket per Toyota Way (one
   mechanism per PR). Threading the new variants through forward
   capture is the next PR (FALSIFY-ATTN-SUB-002 discharge).

3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention
   in `ALL`?** That's the canonical computation order pinned by the
   contract's ordering proof_obligation: `QkvBias → QPostRope →
   KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`.

4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias
   exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`,
   not a separate variant. The enum has 20 distinct variants; `ALL`
   excludes the alias only at the `FromStr` layer.

5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain`
   test?** The contract's `bisection_chain_layer_0` equation pins the
   9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004.
   This test pins the parser side of that gate so a future drift in
   stage names breaks loudly.

## Net effects

- 2 new `SaveTensorStage` variants land
- 5 new tests pin the variants + ordering + parser
- MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007
  bisection cascade; ship % moves when a falsifier flips DISCHARGED)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade ready to thread variants through
  `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 5eeeeb5 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/attn-sub-stages-impl branch May 4, 2026 03:04
noahgift added a commit that referenced this pull request May 4, 2026
…tion cascade STARTED (#1454)

* contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT — scaffold for P0 GPU MoE forward path

Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the
GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md
"NEVER write code before writing a provable contract" — this is the
contract scaffold (M-stage M-GPU-MOE-0 in the contract's
implementation_stages).

Why P0
======

  - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on
    Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
  - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s.
  - MoE inference is ~10× slower than dense, making the spec-prescribed
    default Qwen3-Coder model production-infeasible at ~30 tok/s.
  - The companion's action-stream parity machinery (CCPA-001..013, all
    DISCHARGED) cannot be exercised at production cadence — every
    `apr code` invocation hits the 30 tok/s wall.

What this contract specifies
============================

  metadata.kind: kernel
  status:        DRAFT
  scope:         crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu,
                   scheduler/moe_dispatch}.rs +
                 crates/aprender-compute/src/gpu/moe_kernels.rs (TBD)

  equations:
    - moe_forward_one_layer_gpu  (mirrors v1 CPU equation, +cosine-vs-CPU
                                   invariant, +CudaExecutor::new(0).is_ok()
                                   precondition)
    - gpu_throughput_target      (≥150 tok/s on RTX 4090 over 128-tok
                                   median window, ≥5x CPU baseline)

  proof_obligations: 7
    AC_GPU_MOE_001  cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC
    AC_GPU_MOE_002  router weights sum to 1.0 ± 1e-6
    AC_GPU_MOE_003  output dimensions preserved
    AC_GPU_MOE_004  output finite (no NaN/Inf)
    AC_GPU_MOE_005  cosine ≥ 0.99 vs HF FP16 (inherits from v1)
    AC_GPU_MOE_006  ≥150 tok/s on RTX 4090
    AC_GPU_MOE_007  VRAM utilization ≤ 95% of 24 GB

  falsification_tests: 7
    FALSIFY-QW3-MOE-GPU-001          baseline (no GPU symbol)
    FALSIFY-QW3-MOE-GPU-PARITY-001   M-GPU-MOE-1 cosine vs CPU
    FALSIFY-QW3-MOE-GPU-PARITY-002   M-GPU-MOE-1 cosine vs HF FP16
    FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite
    FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed
    FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s
    FALSIFY-QW3-MOE-GPU-MEMORY-001     ≤ 95% VRAM

  kani_harnesses: 2
    KANI-QW3-MOE-GPU-001  router weights sum (AC_GPU_MOE_002)
    KANI-QW3-MOE-GPU-002  output shape preservation (AC_GPU_MOE_003)

  qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap
    quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM)

Implementation stages
=====================

  M-GPU-MOE-0  This contract scaffold                        SHIPPED
  M-GPU-MOE-1  CUDA kernel + cosine-vs-CPU parity gate       PENDING
  M-GPU-MOE-2  wgpu fallback (CLAUDE.md backend-agnostic)    PENDING
  M-GPU-MOE-3  Throughput ≥150 tok/s + VRAM ≤ 95%            PENDING

When all 3 PENDING stages discharge, status flips DRAFT →
ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention).

Verification
============

  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04)
Refs claude-code-parity-apr POC R10 (risk row mirror)
Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling)
Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline)
Refs arXiv:2305.18398 Dao FlashAttention-2
Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE
Refs arXiv:2101.03961 Fedus Switch Transformers

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(ship-two-models): v2.91.0 → v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED

After §46 declared the v0.32.0 cut HOLD-gated on SHIP-007 layer-0 attention,
the §46.7(a) follow-up cascade kicked off with three PRs in flight (#1450 +
#1451 + #1452).

## What §47 records

| Subsection | Content |
|---|---|
| 47.1 | 8-step cascade roadmap (this amendment captures steps 1-3) |
| 47.2 | What landed in PRs #1450 + #1451 + #1452 |
| 47.3 | **Toyota Way correction in detail** — v1.0.0 → v1.1.0 mid-cascade |
| 47.4 | Pre-existing parent contract drift (QPostRope/KPostRope unwired) |
| 47.5 | Net effects (ship %, coverage tally, pending merges) |
| 47.6 | Open follow-ups (5-step ranked priority list) |
| 47.7 | Five Whys (why amend at 3 PRs, why split §47/§48, etc.) |
| 47.8 | Spec amendment cadence preserved (§41 → §47, 7 amendments) |

## Cascade roadmap

| # | PR | What | Discharge status |
|---|----|------|-------|
| 1 | #1450 | Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 PROPOSED | 5 falsifiers algorithm-bound |
| 2 | #1451 | Enum extension: 2 new SaveTensorStage variants | FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL |
| 3 | #1452 | Research evidence note | No falsifier flip |
| 4 | (next) | forward_traced_with_plan wires 4 sub-stages | FALSIFY-ATTN-SUB-002 + drift fix |
| 5 | (next) | apr diff --values recognizes new stages | FALSIFY-ATTN-SUB-003 |
| 6 | (next) | HF FP16 oracle script extension | unblocks FALSIFY-ATTN-SUB-004 |
| 7 | (next) | Live RTX 4090 bisection | FALSIFY-ATTN-SUB-004 → DISCHARGED |
| 8 | (next) | SHIP-007 root-cause fix | unblocks MODEL-1 GPU |

§47 captures the first 3 (scaffold). §48+ will capture later steps.

## Toyota Way correction (mid-cascade)

v1.0.0 of `trace-attn-sub-stages-v1.yaml` was the day's first defect. It
claimed 5 new SaveTensorStage variants needed; live source inspection
(per `feedback_no_guessing.md`) showed 3 already existed. v1.1.0 corrected
to 2 truly-new variants + added the 9-element `bisection_chain_layer_0`
equation. Cost-of-defect paid at the contract layer (cheapest place); no
code rolled back.

## Pre-existing parent contract drift

Researching the wire-plan for FALSIFY-ATTN-SUB-002 surfaced a drift in
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL): `QPostRope` +
`KPostRope` are in the enum but have NO `emit()` calls in
`forward_traced_with_plan`. A user passing `--save-tensor q_post_rope`
gets a clean exit with no file written — silent failure.

Per `feedback_toyota_way_all_defects.md`: all defects are mine. The
next-cycle FALSIFY-ATTN-SUB-002 PR will close this drift as a free
side-effect by wiring the 2 missing stages alongside the 2 new ones.

## Net effects

- Spec v2.91.0 → **v2.92.0**
- Coverage tally: unchanged this cycle (5 new PARTIAL_ALGORITHM_LEVEL
  slots will increment when PR #1450 lands the YAML)
- MODEL-1 ship %: unchanged at 91% (cascade is scaffold; ship % moves
  at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle)
- MODEL-2 ship %: unchanged at 57%

## Five Whys (compressed)

1. Why amend at 3 PRs? §41-§46 cadence is "one amendment per ≥3-PR cycle"
2. Why split §47/§48? Toyota Way correction is worth pinning
3. Why pin parent drift here, not amend the parent contract? Drift fix
   lands in next-cycle implementation PR; §47 just records it
4. Why no FALSIFY-ATTN-SUB-002 in this cycle? Single-piece flow;
   stacked PRs slow merge throughput
5. Why no parent-contract bump now? Bump requires wire fix landing first
   (FUNCTIONAL claim) — cleaner to bump in next-cycle PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…ith_plan — FALSIFY-ATTN-SUB-002 (#1455)

* contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants)

Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED
that pre-commits to the schema for extending `SaveTensorStage` with FIVE new
intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence
can be bisected element-wise against the HF FP16 oracle (PR #1423).

## Why now (per spec §46.7)

Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest-
leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`:

- cos(APR.attn_norm, HF.attn_norm) = 0.99999995  ✓ (correct)
- cos(APR.attn_out,  HF.attn_out)  = 0.9966      ✗ (wrong)

The bug is somewhere INSIDE the attention block. The existing
`SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` —
too coarse to localize.

## What this contract pins

5 new variants, in computation order inside the attention block:

| New stage | What it captures |
|---|---|
| `QPostRope`   | Q after RoPE (post Q-projection + RoPE rotate) |
| `KPostRope`   | K after RoPE (GQA: shared across head groups) |
| `AttnScores`  | Q·Kᵀ / sqrt(head_dim), pre-softmax |
| `AttnSoftmax` | softmax(scores + causal_mask) |
| `AttnVOut`    | softmax · V (pre output O-projection) |

Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut`

## Falsifiers (5)

| ID | What it predicts | Status |
|---|---|---|
| FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT |
| FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL |

FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must
be falsified to actually pinpoint the SHIP-007 sub-stage. Marked
BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages
implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on
RTX 4090. This contract pins the gate; the implementation cascade follows.

## Five Whys

1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?**
   The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it.
   Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) —
   sub-block contracts are siblings of the parent, not amendments.

2. **Why pin the schema before implementation?**
   Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity →
   extend the enum behind a contract." Contract-first preserves the audit
   chain spec § → contract → implementation PRs → live discharge.

3. **Why these 5 stages and not 3 or 7?**
   The 5 capture points bracket every numerically distinct intermediate
   inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope,
   scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj).
   Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is
   premature — let the bisection localize first, then refine if needed.

4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?**
   PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today.
   ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle
   extension; today neither exists. BLOCKER honestly classifies the gap;
   matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443).

5. **Why is this not just SHIP-007's fix itself?**
   Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract
   delivers the *measurement instrument* that pinpoints the sub-stage; the
   fix is the next PR cascade after that pin lands.

## Net effects

- New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers.
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0.
- MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips).
- MODEL-2 ship %: unchanged at 57%.
- Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract
  is new — they count once it''s wired into the §-amendment chain).
- Unblocks the next PR cascade: enum extension + forward_traced threading +
  apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005
  algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5)

## What's wrong with v1.0.0

v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants
were needed for the SHIP-007 layer-0 attention bisection:
QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut.

Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs`
shows THREE of those five ALREADY EXIST in the parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL:

- `QPostRope`  — already in enum (line 47)
- `KPostRope`  — already in enum (line 49)
- `Attention`  — already in enum (line 51), semantically my "AttnVOut"
                 ("post softmax(Q@Kᵀ)@v, pre O-proj")

Only TWO are truly missing:

- `AttnScores`   — Q·Kᵀ / sqrt(head_dim), pre-softmax
- `AttnSoftmax`  — softmax(scores + causal_mask), pre-V

## Why it happened

Per `feedback_no_guessing.md`: should have run
`pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I
extrapolated from the parent contract description without reading
the live enum source. Toyota Way andon — caught on next iteration.

Per `feedback_toyota_way_all_defects.md`: all defects are mine.
Fixing at the contract level BEFORE any implementation PR depends
on the wrong scope is exactly the cost-of-defect minimization
the toolchain is designed for.

## What v1.1.0 does

- Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL)
- Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax
- Documents the FULL 9-stage layer-0 bisection chain spanning
  parent-contract stages + 2 new ones:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

- Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope
- Adds bisection_chain_layer_0 equation pinning the 9-element
  cosine sequence (with empirical state per memory
  `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966)
- FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF
  FP16 oracle extension to capture 2 new stages on RTX 4090)

## Five Whys

1. **Why did v1.0.0 claim 5 new variants?**
   Authored without reading the live save_tensor_stage.rs source.

2. **Why didn't I read the source first?**
   Skipped the `pmat query SaveTensorStage` step that
   `feedback_no_guessing.md` mandates. Worked from the parent
   contract description's prose ("Embedding, AttnNorm, QkvMatmul,
   AttnOut, ...") which truncated 18 stages to 14.

3. **Why was the parent contract description truncated?**
   Doc-comment in `forward_traced_with_plan` rust source listed
   only 14 stages (the per-layer canonical-FFN order, omitting
   QkvBias + the parent's renamed Attention). My contract reused
   that prose instead of reading the enum directly.

4. **Why does this matter for SHIP-007 ship %?**
   It doesn't yet — the contract is still scaffold scope, no
   implementation PR has shipped against the wrong scope. v1.1.0
   correction lands BEFORE the cascade triggers.

5. **Why amend the contract instead of opening a sibling fix-PR?**
   Same branch (#1450) is the right place. Toyota Way: stop the
   line, fix the defect at source, then continue. A sibling PR
   would split the audit story across two commits with no benefit.

## Net effects

- Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED**
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0
- MODEL-1 ship %: unchanged at 91% (this is contract correction)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade now correctly scoped to 2 new variants,
  not 5 — saves an estimated 60% of the enum-extension PR's LOC

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL

Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450).

Adds the 2 new attention sub-stage variants to `SaveTensorStage`:

- `AttnScores`  — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask
- `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply

Closes the SHIP-007 layer-0 attention bisection gap inside the
Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

## What changed

| File | Change |
|---|---|
| `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) |
| `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) |
| `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 |

## Test results

- `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed**
- 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`,
  `falsify_attn_sub_001_attn_softmax_round_trip`,
  `falsify_attn_sub_001_2_new_stages_in_canonical_order`,
  `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`,
  `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain`
- `cargo check --workspace --lib` — clean

## Falsifier discharge

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) |

Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in
`contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this
PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the
contract is still PROPOSED upstream.

## Five Whys

1. **Why this PR before #1450 lands?** Contract+impl can land
   together — #1450 introduces the contract, this PR provides the
   first implementation evidence. They reference each other and merge
   in either order without conflict.

2. **Why only the enum + tests, not `forward_traced_with_plan`?**
   Enum extension is the smallest atomic ticket per Toyota Way (one
   mechanism per PR). Threading the new variants through forward
   capture is the next PR (FALSIFY-ATTN-SUB-002 discharge).

3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention
   in `ALL`?** That's the canonical computation order pinned by the
   contract's ordering proof_obligation: `QkvBias → QPostRope →
   KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`.

4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias
   exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`,
   not a separate variant. The enum has 20 distinct variants; `ALL`
   excludes the alias only at the `FromStr` layer.

5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain`
   test?** The contract's `bisection_chain_layer_0` equation pins the
   9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004.
   This test pins the parser side of that gate so a future drift in
   stage names breaks loudly.

## Net effects

- 2 new `SaveTensorStage` variants land
- 5 new tests pin the variants + ordering + parser
- MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007
  bisection cascade; ship % moves when a falsifier flips DISCHARGED)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade ready to thread variants through
  `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL

Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451
merges to main, this PR rebases cleanly and lands as a 4-stage wire fix.

## What this PR wires

| Stage | Existed in enum? | emit() existed? | After this PR |
|---|---|---|---|
| QPostRope   | YES | NO  | YES (new emit) |
| KPostRope   | YES | NO  | YES (new emit) |
| AttnScores  | NEW (#1451) | NO  | YES (new emit + accumulator) |
| AttnSoftmax | NEW (#1451) | NO  | YES (new emit + accumulator) |

Closes the parent-contract drift discovered in PR #1452 research evidence:
QPostRope + KPostRope were in the SaveTensorStage enum but had no emit()
calls in forward_traced_with_plan. The parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated
coverage for those 2 stages. This PR closes the drift as a side-effect.

## Implementation details

**QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the
inner loop populates them. Tensors already exist; this is just 2 emit()
calls — zero new allocation.

**AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors
of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside
the inner softmax loop, populate per (head, i, j) — zero overhead when
plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005:
additive purity).

Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes
for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation
scales O(num_heads * seq^2) and is gated by plan.

## Test results

- `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed,
  0 failed, 51 ignored**
- `cargo check -p aprender-serve --lib` — clean
- inference_trace tests: 167/167 PASS
- (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change)

## Falsifier discharge map

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() |

## Five Whys

1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing
   gaps in the parent contract; the same-file fix is a free side-effect
   per Toyota Way "all defects are mine".

2. **Why allocate accumulators only when requested?** O(num_heads * seq^2)
   memory shouldn't be paid on the default forward path. Plan-gating
   keeps the production inference path zero-overhead.

3. **Why insert capture at lines 133, 152, 160 specifically?**
   Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`:
   line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after
   scale (pre-softmax), line 160 = post-softmax probs.

4. **Why use scores_all.is_some() check vs always-allocate?**
   Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer
   regardless of capture. Some(Vec) idiom plus is_some_and check is the
   idiomatic Rust pattern for conditional capture.

5. **Why this PR stacked on #1451 rather than off main?**
   Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only
   exist on #1451's branch. When #1451 merges, this rebases to main as a
   clean 51-line delta.

## Net effects

- 4 stages now wired in `forward_traced_with_plan`
- MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at
  FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle)
- MODEL-2 ship %: unchanged at 57%
- Cascade step 4/8 of §47.1 roadmap delivered

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…pre-existing capture gaps (QPostRope + KPostRope) (#1452)

Records pre-implementation research for FALSIFY-ATTN-SUB-002 (`trace-attn-sub-stages-v1.yaml` v1.1.0).

## What this evidence pins

While researching where to wire AttnScores + AttnSoftmax in
`forward_traced_with_plan` (per the v1.1.0 contract), discovered that
QPostRope + KPostRope variants exist in the SaveTensorStage enum
(lines 47-50) but have **no `emit()` call** in `forward_traced_with_plan`.

The RoPE-rotated tensors q_all + k_all are computed at lines 130-131
but never captured. The parent contract `apr-cli-trace-save-tensor-v1.yaml`
v1.4.0 (FUNCTIONAL) silently overstates coverage for these 2 stages.

## What FALSIFY-ATTN-SUB-002 will wire

When #1451 lands, the next PR will wire 4 capture points (not 2):

| Stage | Source line | Existed in enum? |
|---|---|---|
| QPostRope   | post line 133 | YES (gap) |
| KPostRope   | post line 133 | YES (gap) |
| AttnScores  | line 152 (per head, accumulator) | NEW (#1451) |
| AttnSoftmax | line 160 (per head, accumulator) | NEW (#1451) |

## Why an evidence file, not a 5th stacked PR

Four PRs (#1448-#1451) already in flight. A 5th stacked PR would
slow CI throughput. Recording the implementation plan here so the
next loop iteration can spawn the impl PR off main once #1451 merges.

## Five Whys + cross-references

In `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`:

- Five Whys for scope (4 stages, not 2)
- Wire-plan with insertion points
- Backward-compat test plan
- Next-iteration deliverables checklist

## Net effects

- Evidence file lands; no code change in this PR
- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57%
- Unblocks the next loop iteration's atomic PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…tion cascade ALGORITHM-LEVEL COMPLETE (#1458)

After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452
scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap
steps 4-6 at the algorithm level via PRs #1455, #1456, #1457.

## What landed (§47.1 cascade roadmap)

| Step | PR | Discharge |
|------|----|-----------|
| 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect |
| 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) |
| 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` |

## Toyota Way correction during research (PR #1457)

The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**.

Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines).

## Steps 7-8 require operator action

| Step | Blocker | Workaround |
|------|---------|-----------|
| 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). |
| 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. |

## Net effects

- Spec v2.92.0 → **v2.93.0**.
- §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated.
- Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships.
- **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7).
- **MODEL-2 ship %**: unchanged at **57%**.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant