feat(scripts): HF FP16 oracle extension — capture 4 attention sub-stages (q/k_post_rope, attn_scores, attn_softmax)#1457
Merged
Conversation
…ges via Qwen2Attention.forward monkeypatch Per `docs/specifications/aprender-train/ship-two-models-spec.md` §47.6 step 6 of the SHIP-007 layer-0 attention bisection cascade. Extends `scripts/generate_qwen25_coder_fp16_stages.py` to emit HF FP16 reference tensors for the 4 stages currently missing from the 9-element bisection chain `[attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attn_scores, attn_softmax, attention, attn_out]`. ## What landed | New stage | Where it's captured | |-----------|---------------------| | `q_post_rope` | After `apply_rotary_pos_emb`, pre-`repeat_kv` | | `k_post_rope` | After `apply_rotary_pos_emb`, pre-`repeat_kv` | | `attn_scores` | `Q · Kᵀ * scaling`, pre-mask, pre-softmax | | `attn_softmax` | `softmax(scores + mask)`, pre-V multiply | ## Toyota Way correction during research The pre-implementation research note (`evidence/ship-007-layer0-attn-bisection-2026-05-04/hf-oracle-extension-research.md`) estimated 7 missing stages and ~140 LOC. **Live source inspection of the existing script during this PR found that 3 of those 7 stages (`qkv_matmul`, `qkv_bias`, `attention`) were already captured via existing forward hooks** (`make_qkv_hook` for the first two, `hook_o_proj_pre` for the third). Net new work: 4 stages, ~80 LOC of monkeypatch. Per `feedback_no_guessing.md`: "use pmat query / apr trace / contracts, not speculation". The research note was authored from the docstring's outdated "stages NOT captured" comment without verifying the implementation. Cost-of-defect paid here at the implementation layer (cheapest), no contract bumped. ## How the patch works 1. Force `attn_implementation="eager"` at model load — sdpa/flash-attn fast paths don't expose pre-softmax scores or post-softmax weights as captureable intermediates. Only the eager path is patchable. 2. For each target layer, replace `self_attn.forward` with `traced_forward` that: - Mirrors `transformers.models.qwen2.modeling_qwen2.Qwen2Attention.forward` - Inlines `eager_attention_forward` so the 4 captures land at the right semantic points - Closes over the shared `captured` dict to write `(layer_idx, stage_name) → np.ndarray fp32` 3. Non-target layers retain the original `Qwen2Attention.forward` (per-instance, not class-level monkeypatch). ## Provenance + CLI - `--with-attn-substages` (default ON) — capture the 4 new stages - `--no-attn-substages` — legacy 13-stage capture only - PROVENANCE writer now lists 17 captured stages (13 base + 4 substages) ## Five whys 1. **Why default ON?** The bisection chain in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 SUB-004 invariant is the load-bearing predicate for FALSIFY-ATTN-SUB-004 LIVE on RTX 4090. Default OFF means an operator running the script for SHIP-007 has to remember to opt in. Default ON means the bisection chain is always emit-able. 2. **Why eager attention rather than sdpa/flash-attn?** Pre-softmax scores and post-softmax weights are intermediate tensors INSIDE the attention kernel. Fast-path implementations (sdpa, flash-attn) fuse the entire attention kernel; their internals are not exposed. Eager (Python) attention is the only path where we can intercept these intermediates. 3. **Why per-instance monkeypatch rather than class-level?** Class-level patch would affect all `Qwen2Attention` instances globally — including the non-target layers. Per-instance patch via `types.MethodType` keeps the blast radius minimal: only target layers use the traced forward, others get untouched original behavior. 4. **Why not also capture the `attn_post_v` output (softmax @ V before contiguous() + reshape)?** The existing `attention` stage (captured via `o_proj` forward_pre_hook) IS the post-V output (`attn_output = attn_output.transpose(1, 2).contiguous().reshape(...)` then o_proj's input). The chain is complete with the 4 new stages: scores → softmax → attention (existing). 5. **Why not run this live now to verify shapes match APR side?** This PR is algorithm-level (the monkeypatch implementation). Live verification on RTX 4090 + canonical 7B teacher is cascade step 7 (FALSIFY-ATTN-SUB-004), which produces the bisection finding. Splitting impl from live discharge keeps the audit story clean. ## Net effects - **Coverage**: 13 → 17 captured per-layer stages with `--with-attn-substages` ON. - **Falsifier**: Pre-condition for FALSIFY-ATTN-SUB-004 LIVE met; algorithm-level evidence pinned for SUB-001 + SUB-002 + SUB-003 + SUB-004. - **MODEL-1 ship %**: unchanged at 91% (cascade step 6 lands the oracle, not the live discharge). - **MODEL-2 ship %**: unchanged at 57%. ## Test plan - [x] Script imports cleanly (`uv run --with torch --with transformers --with safetensors --with accelerate python3 -c '...'`) - [x] CLI `--help` shows the new `--with-attn-substages` / `--no-attn-substages` flags - [ ] Live RTX 4090 run is cascade step 7 (FALSIFY-ATTN-SUB-004 DISCHARGE) — gated on operator confirmation per `feedback_compute_pre_authorized.md` named-lane policy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…tion cascade ALGORITHM-LEVEL COMPLETE (#1458) After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452 scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap steps 4-6 at the algorithm level via PRs #1455, #1456, #1457. ## What landed (§47.1 cascade roadmap) | Step | PR | Discharge | |------|----|-----------| | 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect | | 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) | | 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` | ## Toyota Way correction during research (PR #1457) The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**. Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines). ## Steps 7-8 require operator action | Step | Blocker | Workaround | |------|---------|-----------| | 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). | | 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. | ## Net effects - Spec v2.92.0 → **v2.93.0**. - §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated. - Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships. - **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…x + SUB-004 BLOCKER → PARTIAL_ALGORITHM_LEVEL (#1459) * contract(trace-attn-sub-stages-v1): v1.1.0 → v1.2.0 — function-name drift fix in SUB-003 algorithm_evidence ## Why Contract drift discovered after PR #1456 (FALSIFY-ATTN-SUB-003 drift-prevention test) merged on main. The algorithm_evidence block named: ```yaml function_names: - load_tensor_apr_aprt ``` But this function does not exist anywhere in the codebase. The actual functions wired in `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` and exercised by PR #1456's tests are: - `is_aprt_stage_file` (magic-byte detection) - `compute_aprt_stage_stats` (cosine + RMS + top-K) - `run_aprt_stage_diff` (e2e reader + emitter) Per `feedback_no_guessing.md`. Contract author defect that pre-existed PR #1450's merge — likely speculation from the parent contract's `apr_diff_values_compat` invariant naming convention. Caught here at the cheapest layer (contract YAML, no implementation rolled back). ## What landed - Bumped `metadata.version` 1.1.0 → 1.2.0 with v1.2.0 changelog block describing the fix. - Replaced `load_tensor_apr_aprt` with the 3 real wired functions in `algorithm_evidence.function_names`. - Added `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` to `algorithm_evidence.file_paths` (the actual location of the wired functions). - Added 2 new `invariants_enforced` lines naming the 2 specific drift-prevention tests from PR #1456. - Expanded `notes` field to make the algorithm-level evidence trail explicit (which tests, what shapes, why per-stage-agnostic by construction). ## Test plan - [x] `pv validate contracts/trace-attn-sub-stages-v1.yaml` reports `0 error(s), 0 warning(s) — Contract is valid.` - [ ] CI green - [ ] Auto-merge ## Five whys 1. **Why now and not in §47/§48?** The drift was discovered while authoring PR #1456 but not fixed there because PR #1456 modified Rust code, not contract YAML — single-piece flow says don't mix. Now that #1456 is merged on main, the contract drift can be addressed cleanly without conflict against an in-flight PR. 2. **Why a separate PR rather than in PR #1457?** PR #1457 is the HF FP16 oracle script extension (Python-only). Modifying the contract there would couple two independent fixes. This PR is contract-only YAML and lands independently. 3. **Why bump to v1.2.0 rather than v1.1.1?** Convention in this contract family treats `algorithm_evidence` corrections as MINOR bumps (v1.0.0 → v1.1.0 for the Toyota Way scope correction, also algorithm_evidence-level). v1.1.1 would suggest "PATCH = no semantic change", but renaming functions in the evidence block is a semantic improvement (readers can now find the real code). 4. **Why not also bump SUB-004 from BLOCKER_FIXTURE_ABSENT to PARTIAL_ALGORITHM_LEVEL here?** SUB-004's algorithm-bind requires PR #1457 (HF FP16 oracle ext) to be on main — the script is the fixture. PR #1457 is in flight. Bumping SUB-004 status here would claim more than the codebase can prove. Keeping single-piece flow: this PR ships the SUB-003 drift fix only. 5. **Why is the loader genuinely per-stage-agnostic?** `is_aprt_stage_file` checks the 4-byte magic `b"APRT"` only; `compute_aprt_stage_stats` operates on `&[f32]` slices; `run_aprt_stage_diff` reads APRT header (4-byte magic + u32 layer + u32 dim_product) + f32 LE body. Stage names are encoded only in the OUTPUT FILENAME (e.g., `layer_0_attn_scores.aprt`), never in the binary content. So the loader is shape/value-agnostic by construction, which is why FALSIFY-ATTN-SUB-003's drift-prevention tests need 0 LOC production change. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003 algorithm_evidence now correctly names the wired functions. - **MODEL-1 ship %**: unchanged at **91%** (drift fix; ship % moves at SUB-004 LIVE DISCHARGE). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/specifications/aprender-train/ship-two-models-spec.md§47.6 ranked-leverage order — pre-condition for FALSIFY-ATTN-SUB-004 LIVE bisection (step 7).scripts/generate_qwen25_coder_fp16_stages.pyto emit 4 new HF FP16 reference stages via per-instanceQwen2Attention.forwardmonkeypatch.What landed
q_post_ropeapply_rotary_pos_emb, pre-repeat_kvk_post_ropeapply_rotary_pos_emb, pre-repeat_kvattn_scoresQ · Kᵀ * scaling, pre-mask, pre-softmaxattn_softmaxsoftmax(scores + mask), pre-V multiplyTotal per-layer captures: 13 → 17 with
--with-attn-substages(default ON).Toyota Way correction during research
Pre-implementation research note (
evidence/ship-007-layer0-attn-bisection-2026-05-04/hf-oracle-extension-research.md, uncommitted) estimated 7 missing stages, ~140 LOC. Live source inspection during this PR found 3 of those 7 (qkv_matmul,qkv_bias,attention) were already captured via existing forward hooks. Net new work: 4 stages, ~80 LOC monkeypatch + ~40 LOC docstring/CLI/PROVENANCE updates.Per
feedback_no_guessing.md. Cost-of-defect paid at the implementation layer (cheapest place after the research note had ALREADY been authored from outdated docstring lines).How the patch works
Force
attn_implementation=\"eager\"at model load when--with-attn-substagesis ON. Sdpa/flash-attn fast paths fuse the entire attention kernel; pre-softmax scores + post-softmax weights are not exposed. Eager (Python) is the only patchable path.Per-instance monkeypatch (not class-level): for each target layer index,
model.model.layers[idx].self_attn.forward = types.MethodType(traced_forward, attn_module). Non-target layers retain their original forward.traced_forwardmirrorsQwen2Attention.forwardbut inlineseager_attention_forwardso the 4 captures land at the right semantic points. Closes over the sharedcaptureddict to write(layer_idx, stage_name) → np.ndarray fp32.CLI
Bisection chain after this PR
§47 spec
bisection_chain_layer_09-stage cosine sequence — all 9 stages now captureable from HF FP16 reference:Cascade context (§47.1 roadmap)
Test plan
uv run --with torch --with transformers --with safetensors --with accelerate python3 -c '...')--helpshows new--with-attn-substages/--no-attn-substagesflagsfeedback_compute_pre_authorized.mdFive whys
Why default
--with-attn-substagesON? The 9-element bisection chain incontracts/trace-attn-sub-stages-v1.yamlv1.1.0 SUB-004 invariant is the load-bearing predicate for FALSIFY-ATTN-SUB-004 LIVE. Default OFF means every operator invocation has to remember the flag — friction. Default ON means the chain is always emit-able.Why eager attention? Sdpa/flash-attn fuse attention into a single CUDA kernel; their internals (pre-softmax scores, post-softmax weights) are not exposed. Eager (Python) is the only path with patchable intermediates.
Why per-instance, not class-level monkeypatch? Class-level affects all
Qwen2Attentioninstances globally. Per-instance viatypes.MethodTypekeeps blast radius minimal — only target layers use traced forward.Why not capture
attn_post_v(softmax @ V before contiguous + reshape)? The existingattentionstage already captures this viao_projforward_pre_hook (the input to o_proj IS softmax @ V after contiguous reshape). Adding another capture point would duplicate.Why not run live RTX 4090 in this PR? Splitting impl from live discharge: this PR is the algorithm-level monkeypatch (FALSIFY-ATTN-SUB-004 algorithm-bind). The live RTX 4090 run (cascade step 7) discharges SUB-004 to FUNCTIONAL and produces the bisection finding. Operator-triggered to keep the audit clean.
Plain ship % (unchanged this cycle)
🤖 Generated with Claude Code