feat(scripts): HF FP16 oracle extension — capture 4 attention sub-stages (q/k_post_rope, attn_scores, attn_softmax) by noahgift · Pull Request #1457 · paiml/aprender

noahgift · 2026-05-04T04:44:49Z

Summary

Cascade step 6 of 8 per docs/specifications/aprender-train/ship-two-models-spec.md §47.6 ranked-leverage order — pre-condition for FALSIFY-ATTN-SUB-004 LIVE bisection (step 7).
Extends scripts/generate_qwen25_coder_fp16_stages.py to emit 4 new HF FP16 reference stages via per-instance Qwen2Attention.forward monkeypatch.
Pure Python script change; no Rust/cargo touched. Falls outside the SHIP-007 Rust cascade.

What landed

New stage	Where it's captured
`q_post_rope`	After `apply_rotary_pos_emb`, pre-`repeat_kv`
`k_post_rope`	After `apply_rotary_pos_emb`, pre-`repeat_kv`
`attn_scores`	`Q · Kᵀ * scaling`, pre-mask, pre-softmax
`attn_softmax`	`softmax(scores + mask)`, pre-V multiply

Total per-layer captures: 13 → 17 with --with-attn-substages (default ON).

Toyota Way correction during research

Pre-implementation research note (evidence/ship-007-layer0-attn-bisection-2026-05-04/hf-oracle-extension-research.md, uncommitted) estimated 7 missing stages, ~140 LOC. Live source inspection during this PR found 3 of those 7 (qkv_matmul, qkv_bias, attention) were already captured via existing forward hooks. Net new work: 4 stages, ~80 LOC monkeypatch + ~40 LOC docstring/CLI/PROVENANCE updates.

Per feedback_no_guessing.md. Cost-of-defect paid at the implementation layer (cheapest place after the research note had ALREADY been authored from outdated docstring lines).

How the patch works

Force attn_implementation=\"eager\" at model load when --with-attn-substages is ON. Sdpa/flash-attn fast paths fuse the entire attention kernel; pre-softmax scores + post-softmax weights are not exposed. Eager (Python) is the only patchable path.
Per-instance monkeypatch (not class-level): for each target layer index, model.model.layers[idx].self_attn.forward = types.MethodType(traced_forward, attn_module). Non-target layers retain their original forward.
traced_forward mirrors Qwen2Attention.forward but inlines eager_attention_forward so the 4 captures land at the right semantic points. Closes over the shared captured dict to write (layer_idx, stage_name) → np.ndarray fp32.

CLI

--with-attn-substages   # default ON — capture 4 new stages
--no-attn-substages     # legacy 13-stage capture

Bisection chain after this PR

§47 spec bisection_chain_layer_0 9-stage cosine sequence — all 9 stages now captureable from HF FP16 reference:

[attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
 attn_scores, attn_softmax, attention, attn_out]

Cascade context (§47.1 roadmap)

#	PR	Status
1-2	#1450 contract + #1451 enum	MERGED
3	#1452 research evidence	OPEN, auto-merge armed
4	#1455 4-stage wire	MERGED
5	#1456 SUB-003 drift-prevention test	OPEN, auto-merge armed
6	THIS PR HF FP16 oracle ext	new
7	(next) FALSIFY-ATTN-SUB-004 LIVE on RTX 4090	gated on this PR + #1456
8	(next) SHIP-007 root-cause fix	gated on #7

Test plan

Script imports cleanly (uv run --with torch --with transformers --with safetensors --with accelerate python3 -c '...')
CLI --help shows new --with-attn-substages / --no-attn-substages flags
No Rust crate touched — workspace-test unaffected
CI green
Auto-merge
Live RTX 4090 run is cascade step 7 — gated on operator confirmation per feedback_compute_pre_authorized.md

Five whys

Why default --with-attn-substages ON? The 9-element bisection chain in contracts/trace-attn-sub-stages-v1.yaml v1.1.0 SUB-004 invariant is the load-bearing predicate for FALSIFY-ATTN-SUB-004 LIVE. Default OFF means every operator invocation has to remember the flag — friction. Default ON means the chain is always emit-able.
Why eager attention? Sdpa/flash-attn fuse attention into a single CUDA kernel; their internals (pre-softmax scores, post-softmax weights) are not exposed. Eager (Python) is the only path with patchable intermediates.
Why per-instance, not class-level monkeypatch? Class-level affects all Qwen2Attention instances globally. Per-instance via types.MethodType keeps blast radius minimal — only target layers use traced forward.
Why not capture attn_post_v (softmax @ V before contiguous + reshape)? The existing attention stage already captures this via o_proj forward_pre_hook (the input to o_proj IS softmax @ V after contiguous reshape). Adding another capture point would duplicate.
Why not run live RTX 4090 in this PR? Splitting impl from live discharge: this PR is the algorithm-level monkeypatch (FALSIFY-ATTN-SUB-004 algorithm-bind). The live RTX 4090 run (cascade step 7) discharges SUB-004 to FUNCTIONAL and produces the bisection finding. Operator-triggered to keep the audit clean.

Plain ship % (unchanged this cycle)

MODEL-1: 91% (cascade scaffold; ship % moves at SUB-004 LIVE DISCHARGE step 7)
MODEL-2: 57%

🤖 Generated with Claude Code

…ges via Qwen2Attention.forward monkeypatch Per `docs/specifications/aprender-train/ship-two-models-spec.md` §47.6 step 6 of the SHIP-007 layer-0 attention bisection cascade. Extends `scripts/generate_qwen25_coder_fp16_stages.py` to emit HF FP16 reference tensors for the 4 stages currently missing from the 9-element bisection chain `[attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attn_scores, attn_softmax, attention, attn_out]`. ## What landed | New stage | Where it's captured | |-----------|---------------------| | `q_post_rope` | After `apply_rotary_pos_emb`, pre-`repeat_kv` | | `k_post_rope` | After `apply_rotary_pos_emb`, pre-`repeat_kv` | | `attn_scores` | `Q · Kᵀ * scaling`, pre-mask, pre-softmax | | `attn_softmax` | `softmax(scores + mask)`, pre-V multiply | ## Toyota Way correction during research The pre-implementation research note (`evidence/ship-007-layer0-attn-bisection-2026-05-04/hf-oracle-extension-research.md`) estimated 7 missing stages and ~140 LOC. **Live source inspection of the existing script during this PR found that 3 of those 7 stages (`qkv_matmul`, `qkv_bias`, `attention`) were already captured via existing forward hooks** (`make_qkv_hook` for the first two, `hook_o_proj_pre` for the third). Net new work: 4 stages, ~80 LOC of monkeypatch. Per `feedback_no_guessing.md`: "use pmat query / apr trace / contracts, not speculation". The research note was authored from the docstring's outdated "stages NOT captured" comment without verifying the implementation. Cost-of-defect paid here at the implementation layer (cheapest), no contract bumped. ## How the patch works 1. Force `attn_implementation="eager"` at model load — sdpa/flash-attn fast paths don't expose pre-softmax scores or post-softmax weights as captureable intermediates. Only the eager path is patchable. 2. For each target layer, replace `self_attn.forward` with `traced_forward` that: - Mirrors `transformers.models.qwen2.modeling_qwen2.Qwen2Attention.forward` - Inlines `eager_attention_forward` so the 4 captures land at the right semantic points - Closes over the shared `captured` dict to write `(layer_idx, stage_name) → np.ndarray fp32` 3. Non-target layers retain the original `Qwen2Attention.forward` (per-instance, not class-level monkeypatch). ## Provenance + CLI - `--with-attn-substages` (default ON) — capture the 4 new stages - `--no-attn-substages` — legacy 13-stage capture only - PROVENANCE writer now lists 17 captured stages (13 base + 4 substages) ## Five whys 1. **Why default ON?** The bisection chain in `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 SUB-004 invariant is the load-bearing predicate for FALSIFY-ATTN-SUB-004 LIVE on RTX 4090. Default OFF means an operator running the script for SHIP-007 has to remember to opt in. Default ON means the bisection chain is always emit-able. 2. **Why eager attention rather than sdpa/flash-attn?** Pre-softmax scores and post-softmax weights are intermediate tensors INSIDE the attention kernel. Fast-path implementations (sdpa, flash-attn) fuse the entire attention kernel; their internals are not exposed. Eager (Python) attention is the only path where we can intercept these intermediates. 3. **Why per-instance monkeypatch rather than class-level?** Class-level patch would affect all `Qwen2Attention` instances globally — including the non-target layers. Per-instance patch via `types.MethodType` keeps the blast radius minimal: only target layers use the traced forward, others get untouched original behavior. 4. **Why not also capture the `attn_post_v` output (softmax @ V before contiguous() + reshape)?** The existing `attention` stage (captured via `o_proj` forward_pre_hook) IS the post-V output (`attn_output = attn_output.transpose(1, 2).contiguous().reshape(...)` then o_proj's input). The chain is complete with the 4 new stages: scores → softmax → attention (existing). 5. **Why not run this live now to verify shapes match APR side?** This PR is algorithm-level (the monkeypatch implementation). Live verification on RTX 4090 + canonical 7B teacher is cascade step 7 (FALSIFY-ATTN-SUB-004), which produces the bisection finding. Splitting impl from live discharge keeps the audit story clean. ## Net effects - **Coverage**: 13 → 17 captured per-layer stages with `--with-attn-substages` ON. - **Falsifier**: Pre-condition for FALSIFY-ATTN-SUB-004 LIVE met; algorithm-level evidence pinned for SUB-001 + SUB-002 + SUB-003 + SUB-004. - **MODEL-1 ship %**: unchanged at 91% (cascade step 6 lands the oracle, not the live discharge). - **MODEL-2 ship %**: unchanged at 57%. ## Test plan - [x] Script imports cleanly (`uv run --with torch --with transformers --with safetensors --with accelerate python3 -c '...'`) - [x] CLI `--help` shows the new `--with-attn-substages` / `--no-attn-substages` flags - [ ] Live RTX 4090 run is cascade step 7 (FALSIFY-ATTN-SUB-004 DISCHARGE) — gated on operator confirmation per `feedback_compute_pre_authorized.md` named-lane policy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion cascade ALGORITHM-LEVEL COMPLETE (#1458) After §47 recorded the cascade-started milestone (PRs #1450 + #1451 + #1452 scaffolding), the same-day continuation cycle closed §47.1 cascade roadmap steps 4-6 at the algorithm level via PRs #1455, #1456, #1457. ## What landed (§47.1 cascade roadmap) | Step | PR | Discharge | |------|----|-----------| | 4 | #1455 | FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL — wires `QPostRope`+`KPostRope`+`AttnScores`+`AttnSoftmax` in `forward_traced_with_plan`; closes §47.4 parent-contract drift as side effect | | 5 | #1456 | FALSIFY-ATTN-SUB-003 algorithm-level pinned via 2 drift-prevention tests; 0 LOC production change (loader is genuinely per-stage-agnostic, as spec predicted) | | 6 | #1457 | FALSIFY-ATTN-SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL on merge — extends `scripts/generate_qwen25_coder_fp16_stages.py` with `--with-attn-substages` (default ON) installing per-instance `Qwen2Attention.forward` monkeypatch under `attn_implementation="eager"` | ## Toyota Way correction during research (PR #1457) The pre-impl research note estimated **7 missing stages, ~140 LOC**. Live source inspection during PR #1457 found **3 already captured** via existing forward hooks (`make_qkv_hook` derives qkv_matmul/qkv_bias from q_proj/k_proj/v_proj outputs via bias subtraction; `hook_o_proj_pre` captures `attention` as input to o_proj). Net: **4 stages, ~80 LOC monkeypatch**. Per `feedback_no_guessing.md`. Cost-of-defect paid at the implementation layer (cheapest place once the research note had been authored from outdated docstring lines). ## Steps 7-8 require operator action | Step | Blocker | Workaround | |------|---------|-----------| | 7 LIVE | (a) canonical `apr` binary built pre-#1451 — rejects `attn_scores` stage. (b) PyTorch/CUDA driver mismatch on host. | (a) `cargo build --release --features cuda --bin apr`. (b) operator updates driver OR `--device cpu` (multi-min). | | 8 fix | Gated on step 7 bisection finding. | n/a — discovery-driven scope. | ## Net effects - Spec v2.92.0 → **v2.93.0**. - §47.1 cascade roadmap: **6/8 steps algorithm-level COMPLETE**; steps 7-8 LIVE/operator-gated. - Coverage tally: 20+32 → **20+36** (+4 PARTIAL_ALGORITHM_LEVEL from `trace-attn-sub-stages-v1` v1.1.0 falsifiers landing on main when #1450 merged: SUB-001/002/003/005). SUB-004 stays BLOCKER until #1457 ships. - **MODEL-1 ship %**: unchanged at **91%** (cascade is scaffold; ship % moves at SUB-004 LIVE DISCHARGE in step 7). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…x + SUB-004 BLOCKER → PARTIAL_ALGORITHM_LEVEL (#1459) * contract(trace-attn-sub-stages-v1): v1.1.0 → v1.2.0 — function-name drift fix in SUB-003 algorithm_evidence ## Why Contract drift discovered after PR #1456 (FALSIFY-ATTN-SUB-003 drift-prevention test) merged on main. The algorithm_evidence block named: ```yaml function_names: - load_tensor_apr_aprt ``` But this function does not exist anywhere in the codebase. The actual functions wired in `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` and exercised by PR #1456's tests are: - `is_aprt_stage_file` (magic-byte detection) - `compute_aprt_stage_stats` (cosine + RMS + top-K) - `run_aprt_stage_diff` (e2e reader + emitter) Per `feedback_no_guessing.md`. Contract author defect that pre-existed PR #1450's merge — likely speculation from the parent contract's `apr_diff_values_compat` invariant naming convention. Caught here at the cheapest layer (contract YAML, no implementation rolled back). ## What landed - Bumped `metadata.version` 1.1.0 → 1.2.0 with v1.2.0 changelog block describing the fix. - Replaced `load_tensor_apr_aprt` with the 3 real wired functions in `algorithm_evidence.function_names`. - Added `crates/apr-cli/src/commands/diff_05_aprt_stage.rs` to `algorithm_evidence.file_paths` (the actual location of the wired functions). - Added 2 new `invariants_enforced` lines naming the 2 specific drift-prevention tests from PR #1456. - Expanded `notes` field to make the algorithm-level evidence trail explicit (which tests, what shapes, why per-stage-agnostic by construction). ## Test plan - [x] `pv validate contracts/trace-attn-sub-stages-v1.yaml` reports `0 error(s), 0 warning(s) — Contract is valid.` - [ ] CI green - [ ] Auto-merge ## Five whys 1. **Why now and not in §47/§48?** The drift was discovered while authoring PR #1456 but not fixed there because PR #1456 modified Rust code, not contract YAML — single-piece flow says don't mix. Now that #1456 is merged on main, the contract drift can be addressed cleanly without conflict against an in-flight PR. 2. **Why a separate PR rather than in PR #1457?** PR #1457 is the HF FP16 oracle script extension (Python-only). Modifying the contract there would couple two independent fixes. This PR is contract-only YAML and lands independently. 3. **Why bump to v1.2.0 rather than v1.1.1?** Convention in this contract family treats `algorithm_evidence` corrections as MINOR bumps (v1.0.0 → v1.1.0 for the Toyota Way scope correction, also algorithm_evidence-level). v1.1.1 would suggest "PATCH = no semantic change", but renaming functions in the evidence block is a semantic improvement (readers can now find the real code). 4. **Why not also bump SUB-004 from BLOCKER_FIXTURE_ABSENT to PARTIAL_ALGORITHM_LEVEL here?** SUB-004's algorithm-bind requires PR #1457 (HF FP16 oracle ext) to be on main — the script is the fixture. PR #1457 is in flight. Bumping SUB-004 status here would claim more than the codebase can prove. Keeping single-piece flow: this PR ships the SUB-003 drift fix only. 5. **Why is the loader genuinely per-stage-agnostic?** `is_aprt_stage_file` checks the 4-byte magic `b"APRT"` only; `compute_aprt_stage_stats` operates on `&[f32]` slices; `run_aprt_stage_diff` reads APRT header (4-byte magic + u32 layer + u32 dim_product) + f32 LE body. Stage names are encoded only in the OUTPUT FILENAME (e.g., `layer_0_attn_scores.aprt`), never in the binary content. So the loader is shape/value-agnostic by construction, which is why FALSIFY-ATTN-SUB-003's drift-prevention tests need 0 LOC production change. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003 algorithm_evidence now correctly names the wired functions. - **MODEL-1 ship %**: unchanged at **91%** (drift fix; ship % moves at SUB-004 LIVE DISCHARGE). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * contract(trace-attn-sub-stages-v1): SUB-004 BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL — fixture is now on main Bundles the SUB-004 status promotion into the v1.2.0 PR alongside the SUB-003 function-name drift fix already authored. Both changes ship as one v1.2.0 unit because they are the two contract-level updates that follow the §47.1 cascade roadmap closing at the algorithm level. ## Why now PR #1457 (HF FP16 oracle script extension) merged on main. The fixture previously claimed "absent" is now generated by: ``` uv run --with torch --with transformers --with safetensors --with accelerate \ scripts/generate_qwen25_coder_fp16_stages.py \ --output /tmp/qwen25-coder-7b-hf-fp16-stages \ --layers 0 --with-attn-substages ``` Per `feedback_no_guessing.md`: SUB-004's status is now provable from main. Promote. ## What landed Updated SUB-004 algorithm_evidence: - `status`: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL - `file_paths`: added the actual script + APR-side wire files - `function_names`: replaced placeholder `run_hf_fp16_reference` with the 6 real symbols (`install_attn_substages_patch`, `traced_forward`, plus 4 SaveTensorStage variants) - `invariants_enforced`: 1 line → 4 lines explicitly naming what each PR pinned - `notes`: documents the FUNCTIONAL discharge prerequisites (binary rebuild + driver/CPU) Updated metadata.description v1.2.0 changelog to bundle (1) SUB-003 drift fix + (2) SUB-004 promotion as a coherent unit. ## Five whys 1. **Why combine SUB-003 drift fix + SUB-004 promotion in v1.2.0?** Both contract-level changes follow from the same upstream cause (PRs #1455 + #1456 + #1457 landed). Splitting into v1.2.0 + v1.3.0 would force a follow-up rebase + double-review with no audit benefit. 2. **Why PARTIAL_ALGORITHM_LEVEL not FUNCTIONAL?** FUNCTIONAL requires LIVE evidence. The 9-element cosine sequence has not been produced on actual hardware yet. Promoting to FUNCTIONAL without LIVE evidence would claim more than is true. 3. **Why isn't the LIVE run inside this PR?** Per `feedback_compute_pre_authorized.md`, named GPU lanes are pre-authorized but SHIP-007 LIVE bisection is borderline (binary rebuild needed + host driver mismatch). Operator-triggered keeps the audit clean. 4. **Why list SaveTensorStage variants as "function_names"?** They're enum variants, not functions strictly speaking, but they are the symbolic identities that the algorithm-level evidence binds to. The contract validator accepts them. 5. **Why explicit prerequisites in `notes`?** Future readers who see "PARTIAL_ALGORITHM_LEVEL" need to know WHY it's not yet FUNCTIONAL. The notes are the operator-handoff document inside the contract itself. ## Net effects - Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 PROPOSED. - SUB-003: drift fix (3 real wired functions, 2 explicit drift-prevention test pins). - SUB-004: BLOCKER_FIXTURE_ABSENT → PARTIAL_ALGORITHM_LEVEL with 4-line invariants + explicit FUNCTIONAL prereqs. - **MODEL-1 ship %**: unchanged at **91%** (FUNCTIONAL discharge gates ship %, not PARTIAL). - **MODEL-2 ship %**: unchanged at **57%**. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 04:44

noahgift added 2 commits May 4, 2026 07:06

Merge branch 'main' into feat/hf-fp16-oracle-attention-substages

3ecd4c5

Merge branch 'main' into feat/hf-fp16-oracle-attention-substages

2055973

noahgift mentioned this pull request May 4, 2026

spec(ship-two-models): v2.93.0 — §48 SHIP-007 cascade ALGORITHM-LEVEL COMPLETE #1458

Merged

4 tasks

Merge branch 'main' into feat/hf-fp16-oracle-attention-substages

c5e994b

Merge branch 'main' into feat/hf-fp16-oracle-attention-substages

d86e235

noahgift mentioned this pull request May 4, 2026

contract(trace-attn-sub-stages-v1): v1.2.0 — SUB-003 fn-name drift fix + SUB-004 BLOCKER → PARTIAL_ALGORITHM_LEVEL #1459

Merged

3 tasks

noahgift merged commit 9e41ef9 into main May 4, 2026
10 checks passed

noahgift deleted the feat/hf-fp16-oracle-attention-substages branch May 4, 2026 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): HF FP16 oracle extension — capture 4 attention sub-stages (q/k_post_rope, attn_scores, attn_softmax)#1457

feat(scripts): HF FP16 oracle extension — capture 4 attention sub-stages (q/k_post_rope, attn_scores, attn_softmax)#1457
noahgift merged 5 commits into
mainfrom
feat/hf-fp16-oracle-attention-substages

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

What landed

Toyota Way correction during research

How the patch works

CLI

Bisection chain after this PR

Cascade context (§47.1 roadmap)

Test plan

Five whys

Plain ship % (unchanged this cycle)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant