Skip to content

feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry#1083

Merged
noahgift merged 1 commit into
mainfrom
feat/p3-prc-trace-cli-wiring
Apr 28, 2026
Merged

feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry#1083
noahgift merged 1 commit into
mainfrom
feat/p3-prc-trace-cli-wiring

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch.

Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge.

What this PR does

Without this, apr trace --payload <model.gguf> only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection.

After this PR:

$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf
...
FORWARD PASS (with layer tracing):

EMBEDDING:
  ...

LAYER-BY-LAYER ACTIVATIONS:
  Layer  3/28 [OK]
    attn_norm: ...
    qkv      : ...
    ...
    ffn_gate : ...
    ffn_up   : ...
    ffn_silu : ...
    ffn_swigl: ...     ← NEW — the §23 17×-anomaly site, now observable on GGUF
    ffn_out  : ...
    output   : ...
  ...

FINAL LAYER NORM: ...

Files changed

File Change
crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Calls model.forward_traced(&test_tokens) BEFORE generation, prints layer activations + summary
crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped to pub(crate) for reuse from trace.rs (already used by APR side)

The 4 helpers made pub(crate):

  • print_layer_activations
  • print_logit_predictions
  • print_trace_summary
  • print_activation_stats / print_activation_stats_colored

Output format matches APR side exactly — apr trace --payload <file>.apr and <file>.gguf produce side-by-side comparable per-layer stat blocks.

§26.4 binding criterion now runnable

Once this PR + #1081 + #1082 all merge:

$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr  | grep -A1 "Layer  3"
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer  3"
Outcome Verdict
ratio ≥10× SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164
ratio <2× 17× spike is normal Qwen2.5 trained behavior; bug elsewhere

Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008).

Validated

$ cargo check -p apr-cli --features inference
Finished `dev` profile

$ cargo clippy -p apr-cli --features inference -- -D warnings
Finished `dev` profile

Spec references

Test plan

  • CI workspace-test passes
  • CI gate passes
  • No regression in existing GGUF generation tests
  • apr trace --payload <gguf> smoke-test shows new telemetry block (live test post-merge)

🤖 Generated with Claude Code

noahgift added a commit that referenced this pull request Apr 27, 2026
…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087)

Session-end snapshot consolidating today's 10-PR cascade into a
single source-of-truth for next session.

The goal: ship two models to HF, both built end-to-end on the
in-tree Sovereign AI Stack.

Coverage scoreboard EOD 2026-04-27:
| Category    | DISCHARGED | PARTIAL | Total | %D  |
|-------------|-----------:|--------:|------:|----:|
| MODEL-1     |          5 |       5 |    10 | 50% |
| MODEL-2     |          3 |       9 |    12 | 25% |
| GPUTRAIN    |          7 |       0 |     7 |100% |
| Ship Gates  |          - |      12 |    12 |  0% |
| Falsifiers  |          - |       7 |     7 |  0% |
| Sum         |         15 |      33 |    48 | 31% |

Critical path — MODEL-1: PR E (replace helpers::f32_matmul with
Q4K-fused dispatch) discharges 5 PARTIALs at one fix site.
~150-300 LOC.

Critical path — MODEL-2: P1.1 (apr pull dataset extension) →
P1.4 (corpus pull) → P2 (100K-step training) discharges 9
PARTIALs.

10-PR session cascade (6 merged, 4 open + this):
- #1076-#1080: spec + contract foundation (MERGED)
- #1081: P3 PR A scaffold (MERGED)
- #1082-#1083: P3 PR B+C wiring (OPEN, stacked)
- #1084-#1085: §27/§28 binding criterion + root cause (OPEN)
- #1086: PR D forward-parity contract (OPEN)

Falsification chain (complete, root-reached):
§15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next)
"forward path" → ... → "APR F32 vs GGUF Q4K matmul precision"
                            → "binding criterion as durable spec"
                            → "fix at mod_apr_transformer.rs:138-140"

Methodology preserved: zero eprintln!, zero route-arounds, apr
canonical, contract-first, lambda-labs pre-authorized, 5-whys
reaches root.

Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2
(9 ACs).

Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is
a scoreboard, not a discharge.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084)

Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27.
Built apr from PR #1083 branch (commits 77c016b + c657968
+ f249464 from PR A+B+C cascade). Ran `apr trace --payload`
on canonical 7B teacher in BOTH formats with identical prompt
+ tokenizer.

Result:
| Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio |
|------:|------------------:|-------------------:|------:|
| 3     | 1.2216            | 0.0670             | 18.23x |

§26.4 binding criterion threshold: ≥10x → APR-side bug.
**Observed 18.23x — 8x past the threshold, decisive verdict.**

The investigation chain that started in §15.4 (GPU GQA
elimination) has reached its conclusion at §27:

§15.4 → §16 → §17 → §23 → §27 (this)
"Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)"
→ "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**"

Cascade-damping signature confirmed:
- Layers 0-2: ratio ~1.1x (normal)
- Layer 3: 18.23x (anomaly)
- Layers 4-5: 3.3-4.5x (cascade)
- Layer 6+: ~1x (recovered)

This is consistent with a localized perturbation (off-by-one,
buffer aliasing, or F32-vs-Q4K dequant defect at layer-3-
specifically) rather than persistent residual-stream corruption.

Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once
(SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12
→ 28+17 when fix lands.

§27 does NOT discharge by itself — it locates the bug for fixing.
Next investigation reads `inference.rs:160-164` and tests 4 hypotheses:
1. Off-by-one slice indexing
2. Buffer aliasing (scratch reuse pattern)
3. F32-vs-Q4K dequant defect at layer-3 input range
4. Activation overflow (SiLU saturation amplifies multiply)

Methodology held throughout: zero eprintln!, zero route-arounds,
apr is canonical (§26.8), all instrumentation via `apr trace
--payload`. Lambda-labs lane pre-authorized.

Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/:
- apr-trace.txt (13.5 KB)
- gguf-trace.txt (13.7 KB)
- binding-criterion-summary.json

Note: §27 reproduction requires PR #1081 + #1082 + #1083
cascade to merge first (the apr trace --payload <gguf> wiring
is in PR C). Evidence was generated with a local build of PR
#1083 branch.

Spec v2.71.0 → v2.72.0. Coverage flip pending fix.

Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict
References:
- §15.4 (PR #1062) — GPU GQA eliminated
- §16 (PR #1063) — APR CPU isolated
- §17 (PR #1064) — layer-3 FFN sub-block
- §23 (PR #1075) — layer-3 ffn_swigl named
- §26.8 (PR #1079) — apr-is-canonical methodology rule
- PR #1081 (P3 PR A scaffold)
- PR #1082 (P3 PR B sub-FFN populate)
- PR #1083 (P3 PR C CLI wiring)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/p3-prb-gguf-forward-traced-subffn branch from c657968 to deee405 Compare April 28, 2026 06:40
Base automatically changed from feat/p3-prb-gguf-forward-traced-subffn to main April 28, 2026 07:04
…— emits per-layer LayerActivation telemetry

P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new
forward_traced method (PR A scaffold + PR B sub-FFN populate) into
the apr-cli trace dispatch. Without this, `apr trace --payload
<model.gguf>` only does generation+garbage-detection — it does NOT
emit per-layer telemetry needed for the §23 layer-3 ffn_swigl
APR-vs-GGUF bisection.

Changes:

1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf
   Now calls model.forward_traced(&test_tokens) BEFORE generation,
   prints embed/per-layer/final-norm/logit/summary stats via the
   existing vector_stats helpers. Falls back gracefully on Err
   (e.g., encoder-decoder models from PR A's guard).

2. crates/apr-cli/src/commands/vector_stats.rs
   4 helpers flipped from private to pub(crate) so trace.rs
   GGUF dispatch can reuse them (they were already used by the APR
   dispatch in run_traced_inference_apr):
   - print_layer_activations
   - print_logit_predictions
   - print_trace_summary
   - print_activation_stats / print_activation_stats_colored

Output format matches the APR side exactly, so `apr trace --payload
<file>.apr` and `apr trace --payload <file>.gguf` produce
side-by-side comparable per-layer stat blocks. The §23 layer-3
ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and
ffn_out (already handled by print_layer_activations:137-142
suppression-when-zero pattern from PR #1066).

After this PR + PR A + PR B all merge, the §26.4 binding criterion
becomes runnable on noah-Lambda-Vector RTX 4090:

```
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr  | grep -A1 "Layer  3"
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer  3"
```

Outcome:
- ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164
- ratio  <2× → 17× spike is normal Qwen2.5 trained behavior

Either discharges 5 MODEL-1 PARTIALs at once per §17.5
(SHIP-002/005/006/007/008).

Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A).
Will retarget to main once both merge.

Validated:
- `cargo check -p apr-cli --features inference` exits 0
- `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0

Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step
References:
- PR #1081 (P3 PR A: GGUF forward_traced scaffold)
- PR #1082 (P3 PR B: sub-FFN populate)
- §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side)
- project_ship_007_gguf_forward_traced_plan.md (CLI wiring step)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/p3-prc-trace-cli-wiring branch from f249464 to f5623ec Compare April 28, 2026 07:07
@noahgift noahgift enabled auto-merge (squash) April 28, 2026 07:07
@noahgift noahgift merged commit dbef003 into main Apr 28, 2026
10 checks passed
@noahgift noahgift deleted the feat/p3-prc-trace-cli-wiring branch April 28, 2026 07:34
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81 (#1098)

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…gate matmul output

After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today,
ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF
formats. First time we have side-by-side per-layer sub-FFN stats.

Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out):

| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_norm (input) | 0.995 | 1.035 | 0.96× |
| ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence |
| ffn_up | 1.335 | 1.456 | 0.92× |
| ffn_silu | 0.168 | 0.037 | 4.59× silu amp |
| ffn_swigl | 1.222 | 0.067 | 18.23× compound |
| ffn_out | 11.459 | 0.191 | 60.0× cascade |

Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate
stats diverge significantly. Yet:
- Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier
  via diag_compare_layer3_ffn.rs)
- ffn_norm inputs agree within 5% on aggregate stats

The remaining hypothesis: per-element values of ffn_norm input differ
(despite similar std), produced by cumulative F32 precision drift
through layers 0-2 residual connections. Per-element diff at this
specific stage is the next investigation step.

## Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but
its APR backend produces wrong outputs. SHIP-002/005/006/007/008
(5 PARTIALs) all depend on this fix. With this bisection:

- Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to
  "(layer 3, ffn_gate matmul output)" — first statistical divergence
- Weights agree → fix not in converter
- Aggregate input stats agree → fix in per-element behavior of
  ffn_norm input or matmul nondeterminism
- Once per-element source identified and fixed, the 5 PARTIALs
  promote to DISCHARGED and MODEL-1 ships cleanly through both
  APR and GGUF backends

Files:
- evidence/ship-007-layer3-bisection-2026-04-28/findings.md
- evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
- evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…gate matmul (first statistical site) (#1099)

* docs(ship-007): layer-3 sub-FFN bisection — divergence STARTS at ffn_gate matmul output

After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today,
ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF
formats. First time we have side-by-side per-layer sub-FFN stats.

Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out):

| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_norm (input) | 0.995 | 1.035 | 0.96× |
| ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence |
| ffn_up | 1.335 | 1.456 | 0.92× |
| ffn_silu | 0.168 | 0.037 | 4.59× silu amp |
| ffn_swigl | 1.222 | 0.067 | 18.23× compound |
| ffn_out | 11.459 | 0.191 | 60.0× cascade |

Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate
stats diverge significantly. Yet:
- Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier
  via diag_compare_layer3_ffn.rs)
- ffn_norm inputs agree within 5% on aggregate stats

The remaining hypothesis: per-element values of ffn_norm input differ
(despite similar std), produced by cumulative F32 precision drift
through layers 0-2 residual connections. Per-element diff at this
specific stage is the next investigation step.

## Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but
its APR backend produces wrong outputs. SHIP-002/005/006/007/008
(5 PARTIALs) all depend on this fix. With this bisection:

- Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to
  "(layer 3, ffn_gate matmul output)" — first statistical divergence
- Weights agree → fix not in converter
- Aggregate input stats agree → fix in per-element behavior of
  ffn_norm input or matmul nondeterminism
- Once per-element source identified and fixed, the 5 PARTIALs
  promote to DISCHARGED and MODEL-1 ships cleanly through both
  APR and GGUF backends

Files:
- evidence/ship-007-layer3-bisection-2026-04-28/findings.md
- evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
- evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-007): per-layer drift accumulation analysis — testable hypothesis for the fix

Parsed apr-trace.txt and gguf-trace.txt to compute APR/GGUF std ratio
across all sub-stages of layers 0-6. Result: drift accumulates gradually
in layers 0-2 (output ratio 1.12 → 1.39 → 1.30) then EXPLODES at layer 3
(output ratio 18.57x).

Layer-3 ffn_gate matmul (byte-identical weights) produces 36% wider
output distribution than GGUF, despite ffn_norm input agreeing within
5% on aggregate stats. Silu's saturated regime at gate values near -6
amplifies the 36% to 4.6x ffn_silu, then 18.2x ffn_swigl, then 60x ffn_out.

The bug is CUMULATIVE per-element F32 precision drift through layers
0-2 residual connections.

## Concrete next investigation step

Hypothesis: APR's matmul reduction is parallel (rayon) producing
non-deterministic ordering of f32 accumulations. GGUF's may be serial
or have fixed deterministic order. F32 accumulation is non-associative;
different orders → different per-element results.

Test: run APR forward twice with same input, element-wise compare
layer-3 ffn_swigl. If non-deterministic across runs, parallel reduction
is the source.

## Path to shipping MODEL-1

If hypothesis confirmed:
1. Fix APR matmul reduction order to be deterministic
2. Re-run trace, verify layer-3 ffn_swigl ratio drops below 1.5x
3. Verify SHIP-002/005/006/007/008 PARTIALs flip to DISCHARGED
4. MODEL-1 ships cleanly through both APR and GGUF backends
   (paiml/qwen2.5-coder-7b-apache-q4k-v1)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 30, 2026
…size parity (#1109)

Implements `Option<LastTokenStats>` field on `LayerActivation` per
SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007
(contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107).

What changes:
- New `LastTokenStats` struct mirroring 10 ActivationStats slots,
  computed only over last token's slice (hidden_dim or
  intermediate_dim elements per slot).
- `LayerActivation.last_token: Option<LastTokenStats>` field, default
  None for backwards-compat.
- `AprTransformer::forward_traced` populates last_token via
  `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots.
- `OwnedQuantizedModel::forward_traced` populates last_token by
  cloning existing single-token stats (GGUF already traces only
  the last token).
- 2 new unit tests pin schema invariants (default-None backwards-
  compat + populated-count == hidden_dim or intermediate_dim).
- 6/6 unit tests PASS.

Live verification (RTX 4090, canonical 7B teacher, prior iteration):
  ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
  Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
  ALL 28 layers Pass v1.0.0 ratio gate.

The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY
a sample-size artifact — see §38 (PR #1108) for full analysis.

Five-whys (recorded in §38.6):
1. Why isn't MODEL-1 inference correct? `apr run` gibberish.
2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading.
3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only.
4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched
   API structurally but not semantically.
5. What's the fix? Make both reporters use same sample (this PR).

Spec ref: §37 (PR #1105), §38 (PR #1108).
Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107).
Coverage scoreboard unchanged (15+33).

Authored in isolated worktree to avoid git-environment race
condition that prevented commit in prior iteration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 30, 2026
…-size-parity gate (#1107)

Per SPEC-SHIP-TWO-001 §37 (TRACE-CAPTURE-POINT MISMATCH), the v1.0.0
ratio gates assume APR and GGUF forward_traced compute stats over the
SAME tensor sample. They DO NOT today:

  apr_layer[0].attn_norm_stats.count == 25088 (7 × 3584, all-tokens)
  gguf_layer[0].attn_norm_stats.count == 3584  (1 × 3584, last-only)

The 18.23× layer-3 ffn_swigl ratio mixes real precision drift with
sample-size artifact in unknown proportions. v1.0.0 ratio gates
produce false positives (Pass when there's a real bug masked by
sampling) or false negatives (Fail when sampling alone explains the
drift).

This bump adds:

- New equation `trace_sample_size_parity` documenting the count-equality
  precondition with both fix-surface options listed (§37.5).
- New falsification test FALSIFY-APR-GGUF-PARITY-007 enforcing
  apr_layer[i].count == gguf_layer[i].count across 28 layers ×
  10 stat slots = 280 equality checks. FAILS today; PASSES post-fix.
- New kani harness KH-APR-GGUF-PARITY-003 with bound=280.
- Two new proof_obligations (invariant + soundness) tying ratio-gate
  credibility to count-parity restoration.

Five-whys (recorded in §37.7 of spec):

1. Why isn't MODEL-1 inference correct? `apr run` produces gibberish.
2. Why has bisection been hard? §17→§27 chain produces 18.23× signal,
   but downstream investigations keep finding "byte-identical" results.
3. Why do byte-identical inputs produce different std reports?
   Different sample sizes (apples-to-oranges).
4. Why didn't this come up before? PRs #1082+#1083 matched APR's
   API structurally but not semantically.
5. What's the fix? Make both reporters use the same sample. Then
   re-measure ratio gates.

Per §26.8 stack-tool-extension methodology + feedback_pv_not_bash_for_contracts.md:
this contract bump precedes the implementation PR. Validates clean via `pv validate`.

Spec ref: §37 (PR #1105 docs/ship-007-trace-capture-mismatch).
Coverage scoreboard unchanged (15+33).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…— FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL (#1451)

* contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants)

Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED
that pre-commits to the schema for extending `SaveTensorStage` with FIVE new
intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence
can be bisected element-wise against the HF FP16 oracle (PR #1423).

## Why now (per spec §46.7)

Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest-
leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`:

- cos(APR.attn_norm, HF.attn_norm) = 0.99999995  ✓ (correct)
- cos(APR.attn_out,  HF.attn_out)  = 0.9966      ✗ (wrong)

The bug is somewhere INSIDE the attention block. The existing
`SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` —
too coarse to localize.

## What this contract pins

5 new variants, in computation order inside the attention block:

| New stage | What it captures |
|---|---|
| `QPostRope`   | Q after RoPE (post Q-projection + RoPE rotate) |
| `KPostRope`   | K after RoPE (GQA: shared across head groups) |
| `AttnScores`  | Q·Kᵀ / sqrt(head_dim), pre-softmax |
| `AttnSoftmax` | softmax(scores + causal_mask) |
| `AttnVOut`    | softmax · V (pre output O-projection) |

Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut`

## Falsifiers (5)

| ID | What it predicts | Status |
|---|---|---|
| FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT |
| FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL |

FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must
be falsified to actually pinpoint the SHIP-007 sub-stage. Marked
BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages
implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on
RTX 4090. This contract pins the gate; the implementation cascade follows.

## Five Whys

1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?**
   The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it.
   Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) —
   sub-block contracts are siblings of the parent, not amendments.

2. **Why pin the schema before implementation?**
   Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity →
   extend the enum behind a contract." Contract-first preserves the audit
   chain spec § → contract → implementation PRs → live discharge.

3. **Why these 5 stages and not 3 or 7?**
   The 5 capture points bracket every numerically distinct intermediate
   inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope,
   scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj).
   Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is
   premature — let the bisection localize first, then refine if needed.

4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?**
   PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today.
   ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle
   extension; today neither exists. BLOCKER honestly classifies the gap;
   matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443).

5. **Why is this not just SHIP-007's fix itself?**
   Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract
   delivers the *measurement instrument* that pinpoints the sub-stage; the
   fix is the next PR cascade after that pin lands.

## Net effects

- New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers.
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0.
- MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips).
- MODEL-2 ship %: unchanged at 57%.
- Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract
  is new — they count once it''s wired into the §-amendment chain).
- Unblocks the next PR cascade: enum extension + forward_traced threading +
  apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005
  algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5)

## What's wrong with v1.0.0

v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants
were needed for the SHIP-007 layer-0 attention bisection:
QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut.

Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs`
shows THREE of those five ALREADY EXIST in the parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL:

- `QPostRope`  — already in enum (line 47)
- `KPostRope`  — already in enum (line 49)
- `Attention`  — already in enum (line 51), semantically my "AttnVOut"
                 ("post softmax(Q@Kᵀ)@v, pre O-proj")

Only TWO are truly missing:

- `AttnScores`   — Q·Kᵀ / sqrt(head_dim), pre-softmax
- `AttnSoftmax`  — softmax(scores + causal_mask), pre-V

## Why it happened

Per `feedback_no_guessing.md`: should have run
`pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I
extrapolated from the parent contract description without reading
the live enum source. Toyota Way andon — caught on next iteration.

Per `feedback_toyota_way_all_defects.md`: all defects are mine.
Fixing at the contract level BEFORE any implementation PR depends
on the wrong scope is exactly the cost-of-defect minimization
the toolchain is designed for.

## What v1.1.0 does

- Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL)
- Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax
- Documents the FULL 9-stage layer-0 bisection chain spanning
  parent-contract stages + 2 new ones:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

- Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope
- Adds bisection_chain_layer_0 equation pinning the 9-element
  cosine sequence (with empirical state per memory
  `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966)
- FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF
  FP16 oracle extension to capture 2 new stages on RTX 4090)

## Five Whys

1. **Why did v1.0.0 claim 5 new variants?**
   Authored without reading the live save_tensor_stage.rs source.

2. **Why didn't I read the source first?**
   Skipped the `pmat query SaveTensorStage` step that
   `feedback_no_guessing.md` mandates. Worked from the parent
   contract description's prose ("Embedding, AttnNorm, QkvMatmul,
   AttnOut, ...") which truncated 18 stages to 14.

3. **Why was the parent contract description truncated?**
   Doc-comment in `forward_traced_with_plan` rust source listed
   only 14 stages (the per-layer canonical-FFN order, omitting
   QkvBias + the parent's renamed Attention). My contract reused
   that prose instead of reading the enum directly.

4. **Why does this matter for SHIP-007 ship %?**
   It doesn't yet — the contract is still scaffold scope, no
   implementation PR has shipped against the wrong scope. v1.1.0
   correction lands BEFORE the cascade triggers.

5. **Why amend the contract instead of opening a sibling fix-PR?**
   Same branch (#1450) is the right place. Toyota Way: stop the
   line, fix the defect at source, then continue. A sibling PR
   would split the audit story across two commits with no benefit.

## Net effects

- Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED**
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0
- MODEL-1 ship %: unchanged at 91% (this is contract correction)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade now correctly scoped to 2 new variants,
  not 5 — saves an estimated 60% of the enum-extension PR's LOC

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL

Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450).

Adds the 2 new attention sub-stage variants to `SaveTensorStage`:

- `AttnScores`  — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask
- `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply

Closes the SHIP-007 layer-0 attention bisection gap inside the
Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

## What changed

| File | Change |
|---|---|
| `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) |
| `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) |
| `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 |

## Test results

- `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed**
- 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`,
  `falsify_attn_sub_001_attn_softmax_round_trip`,
  `falsify_attn_sub_001_2_new_stages_in_canonical_order`,
  `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`,
  `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain`
- `cargo check --workspace --lib` — clean

## Falsifier discharge

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) |

Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in
`contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this
PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the
contract is still PROPOSED upstream.

## Five Whys

1. **Why this PR before #1450 lands?** Contract+impl can land
   together — #1450 introduces the contract, this PR provides the
   first implementation evidence. They reference each other and merge
   in either order without conflict.

2. **Why only the enum + tests, not `forward_traced_with_plan`?**
   Enum extension is the smallest atomic ticket per Toyota Way (one
   mechanism per PR). Threading the new variants through forward
   capture is the next PR (FALSIFY-ATTN-SUB-002 discharge).

3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention
   in `ALL`?** That's the canonical computation order pinned by the
   contract's ordering proof_obligation: `QkvBias → QPostRope →
   KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`.

4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias
   exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`,
   not a separate variant. The enum has 20 distinct variants; `ALL`
   excludes the alias only at the `FromStr` layer.

5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain`
   test?** The contract's `bisection_chain_layer_0` equation pins the
   9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004.
   This test pins the parser side of that gate so a future drift in
   stage names breaks loudly.

## Net effects

- 2 new `SaveTensorStage` variants land
- 5 new tests pin the variants + ordering + parser
- MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007
  bisection cascade; ship % moves when a falsifier flips DISCHARGED)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade ready to thread variants through
  `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…ith_plan — FALSIFY-ATTN-SUB-002 (#1455)

* contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants)

Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED
that pre-commits to the schema for extending `SaveTensorStage` with FIVE new
intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence
can be bisected element-wise against the HF FP16 oracle (PR #1423).

## Why now (per spec §46.7)

Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest-
leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`:

- cos(APR.attn_norm, HF.attn_norm) = 0.99999995  ✓ (correct)
- cos(APR.attn_out,  HF.attn_out)  = 0.9966      ✗ (wrong)

The bug is somewhere INSIDE the attention block. The existing
`SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` —
too coarse to localize.

## What this contract pins

5 new variants, in computation order inside the attention block:

| New stage | What it captures |
|---|---|
| `QPostRope`   | Q after RoPE (post Q-projection + RoPE rotate) |
| `KPostRope`   | K after RoPE (GQA: shared across head groups) |
| `AttnScores`  | Q·Kᵀ / sqrt(head_dim), pre-softmax |
| `AttnSoftmax` | softmax(scores + causal_mask) |
| `AttnVOut`    | softmax · V (pre output O-projection) |

Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut`

## Falsifiers (5)

| ID | What it predicts | Status |
|---|---|---|
| FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT |
| FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL |

FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must
be falsified to actually pinpoint the SHIP-007 sub-stage. Marked
BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages
implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on
RTX 4090. This contract pins the gate; the implementation cascade follows.

## Five Whys

1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?**
   The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it.
   Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) —
   sub-block contracts are siblings of the parent, not amendments.

2. **Why pin the schema before implementation?**
   Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity →
   extend the enum behind a contract." Contract-first preserves the audit
   chain spec § → contract → implementation PRs → live discharge.

3. **Why these 5 stages and not 3 or 7?**
   The 5 capture points bracket every numerically distinct intermediate
   inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope,
   scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj).
   Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is
   premature — let the bisection localize first, then refine if needed.

4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?**
   PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today.
   ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle
   extension; today neither exists. BLOCKER honestly classifies the gap;
   matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443).

5. **Why is this not just SHIP-007's fix itself?**
   Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract
   delivers the *measurement instrument* that pinpoints the sub-stage; the
   fix is the next PR cascade after that pin lands.

## Net effects

- New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers.
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0.
- MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips).
- MODEL-2 ship %: unchanged at 57%.
- Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract
  is new — they count once it''s wired into the §-amendment chain).
- Unblocks the next PR cascade: enum extension + forward_traced threading +
  apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005
  algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5)

## What's wrong with v1.0.0

v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants
were needed for the SHIP-007 layer-0 attention bisection:
QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut.

Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs`
shows THREE of those five ALREADY EXIST in the parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL:

- `QPostRope`  — already in enum (line 47)
- `KPostRope`  — already in enum (line 49)
- `Attention`  — already in enum (line 51), semantically my "AttnVOut"
                 ("post softmax(Q@Kᵀ)@v, pre O-proj")

Only TWO are truly missing:

- `AttnScores`   — Q·Kᵀ / sqrt(head_dim), pre-softmax
- `AttnSoftmax`  — softmax(scores + causal_mask), pre-V

## Why it happened

Per `feedback_no_guessing.md`: should have run
`pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I
extrapolated from the parent contract description without reading
the live enum source. Toyota Way andon — caught on next iteration.

Per `feedback_toyota_way_all_defects.md`: all defects are mine.
Fixing at the contract level BEFORE any implementation PR depends
on the wrong scope is exactly the cost-of-defect minimization
the toolchain is designed for.

## What v1.1.0 does

- Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL)
- Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax
- Documents the FULL 9-stage layer-0 bisection chain spanning
  parent-contract stages + 2 new ones:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

- Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope
- Adds bisection_chain_layer_0 equation pinning the 9-element
  cosine sequence (with empirical state per memory
  `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966)
- FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF
  FP16 oracle extension to capture 2 new stages on RTX 4090)

## Five Whys

1. **Why did v1.0.0 claim 5 new variants?**
   Authored without reading the live save_tensor_stage.rs source.

2. **Why didn't I read the source first?**
   Skipped the `pmat query SaveTensorStage` step that
   `feedback_no_guessing.md` mandates. Worked from the parent
   contract description's prose ("Embedding, AttnNorm, QkvMatmul,
   AttnOut, ...") which truncated 18 stages to 14.

3. **Why was the parent contract description truncated?**
   Doc-comment in `forward_traced_with_plan` rust source listed
   only 14 stages (the per-layer canonical-FFN order, omitting
   QkvBias + the parent's renamed Attention). My contract reused
   that prose instead of reading the enum directly.

4. **Why does this matter for SHIP-007 ship %?**
   It doesn't yet — the contract is still scaffold scope, no
   implementation PR has shipped against the wrong scope. v1.1.0
   correction lands BEFORE the cascade triggers.

5. **Why amend the contract instead of opening a sibling fix-PR?**
   Same branch (#1450) is the right place. Toyota Way: stop the
   line, fix the defect at source, then continue. A sibling PR
   would split the audit story across two commits with no benefit.

## Net effects

- Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED**
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0
- MODEL-1 ship %: unchanged at 91% (this is contract correction)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade now correctly scoped to 2 new variants,
  not 5 — saves an estimated 60% of the enum-extension PR's LOC

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): SaveTensorStage gains AttnScores + AttnSoftmax — FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL

Implements `contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 (PROPOSED, in PR #1450).

Adds the 2 new attention sub-stage variants to `SaveTensorStage`:

- `AttnScores`  — Q·Kᵀ / sqrt(head_dim), pre-softmax + pre-causal-mask
- `AttnSoftmax` — softmax(scores + causal_mask), pre-V-multiply

Closes the SHIP-007 layer-0 attention bisection gap inside the
Q·Kᵀ → softmax → ·V chain. The 9-stage layer-0 capture chain is now:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

## What changed

| File | Change |
|---|---|
| `save_tensor_stage.rs` | enum: 18 → **20** variants; `ALL` const, `canonical_name`, `FromStr` updated; doc-comment lists 21 names (incl. `layer_output` alias) |
| `save_tensor_stage.rs::tests` | Renamed `all_eighteen_*` → `all_twenty_*`; updated `is_per_layer_count` (18+2 = 20) + `canonical_names_match_contract_enumeration` to include the 2 new names; **4 new tests** for FALSIFY-ATTN-SUB-001 (round-trip, ordering, parser-list) |
| `save_tensor_plan.rs` | `all_keyword_expands_to_eighteen_stages` → `all_keyword_expands_to_twenty_stages`; `all_keyword_case_insensitive` count updated 18 → 20 |

## Test results

- `cargo test -p aprender-serve --lib inference_trace` — **167 passed, 0 failed**
- 4 new tests: `falsify_attn_sub_001_attn_scores_round_trip`,
  `falsify_attn_sub_001_attn_softmax_round_trip`,
  `falsify_attn_sub_001_2_new_stages_in_canonical_order`,
  `falsify_attn_sub_001_parse_list_accepts_2_new_stages_together`,
  `falsify_attn_sub_001_parse_list_accepts_full_attn_block_chain`
- `cargo check --workspace --lib` — clean

## Falsifier discharge

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-001 | PARTIAL_ALGORITHM_LEVEL | **FUNCTIONAL** (eligible) | enum has 20 variants, parse_list accepts the 2 new tokens, ordering test passes |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (no change yet — depends on `forward_traced_with_plan` threading, follow-up PR) |

Functional discharge of FALSIFY-ATTN-SUB-001 will be promoted in
`contracts/trace-attn-sub-stages-v1.yaml` v1.1.0 → v1.2.0 once this
PR + #1450 land. Today it stays PARTIAL_ALGORITHM_LEVEL because the
contract is still PROPOSED upstream.

## Five Whys

1. **Why this PR before #1450 lands?** Contract+impl can land
   together — #1450 introduces the contract, this PR provides the
   first implementation evidence. They reference each other and merge
   in either order without conflict.

2. **Why only the enum + tests, not `forward_traced_with_plan`?**
   Enum extension is the smallest atomic ticket per Toyota Way (one
   mechanism per PR). Threading the new variants through forward
   capture is the next PR (FALSIFY-ATTN-SUB-002 discharge).

3. **Why insert AttnScores+AttnSoftmax between KPostRope and Attention
   in `ALL`?** That's the canonical computation order pinned by the
   contract's ordering proof_obligation: `QkvBias → QPostRope →
   KPostRope → AttnScores → AttnSoftmax → Attention → AttnOut`.

4. **Why bump `ALL` count from 18 to 20 (not 19) when only 1 alias
   exists?** `LayerOutput` is a parse-only alias for `PostFfnResidual`,
   not a separate variant. The enum has 20 distinct variants; `ALL`
   excludes the alias only at the `FromStr` layer.

5. **Why include the 9-stage `parse_list_accepts_full_attn_block_chain`
   test?** The contract's `bisection_chain_layer_0` equation pins the
   9-element cosine sequence as the gate for FALSIFY-ATTN-SUB-004.
   This test pins the parser side of that gate so a future drift in
   stage names breaks loudly.

## Net effects

- 2 new `SaveTensorStage` variants land
- 5 new tests pin the variants + ordering + parser
- MODEL-1 ship %: unchanged at 91% (this is part of the SHIP-007
  bisection cascade; ship % moves when a falsifier flips DISCHARGED)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade ready to thread variants through
  `forward_traced_with_plan` next (FALSIFY-ATTN-SUB-002)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): wire 4 attention sub-stages in forward_traced_with_plan — FALSIFY-ATTN-SUB-002 PARTIAL_ALGORITHM_LEVEL

Stacked on #1451 (which adds the 2 new SaveTensorStage variants). When #1451
merges to main, this PR rebases cleanly and lands as a 4-stage wire fix.

## What this PR wires

| Stage | Existed in enum? | emit() existed? | After this PR |
|---|---|---|---|
| QPostRope   | YES | NO  | YES (new emit) |
| KPostRope   | YES | NO  | YES (new emit) |
| AttnScores  | NEW (#1451) | NO  | YES (new emit + accumulator) |
| AttnSoftmax | NEW (#1451) | NO  | YES (new emit + accumulator) |

Closes the parent-contract drift discovered in PR #1452 research evidence:
QPostRope + KPostRope were in the SaveTensorStage enum but had no emit()
calls in forward_traced_with_plan. The parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstated
coverage for those 2 stages. This PR closes the drift as a side-effect.

## Implementation details

**QPostRope/KPostRope** (post line 133): emit q_all/k_all directly after the
inner loop populates them. Tensors already exist; this is just 2 emit()
calls — zero new allocation.

**AttnScores/AttnSoftmax** (inside head loop): allocate accumulator tensors
of shape `[num_heads × seq × seq]` ONLY when the plan requests them. Inside
the inner softmax loop, populate per (head, i, j) — zero overhead when
plan is None or doesn't ask for these stages (FALSIFY-ATTN-SUB-005:
additive purity).

Memory cost: BOS forward (seq=1) → num_heads * 1 * 1 * 4 bytes = 112 bytes
for Qwen2.5-Coder-7B (28 heads). Negligible. For longer seq, allocation
scales O(num_heads * seq^2) and is gated by plan.

## Test results

- `cargo test -p aprender-serve --lib -- --skip "gpu::"` — **13944 passed,
  0 failed, 51 ignored**
- `cargo check -p aprender-serve --lib` — clean
- inference_trace tests: 167/167 PASS
- (gpu:: tests have a pre-existing SIGABRT flake unrelated to this change)

## Falsifier discharge map

| ID | Status before | Status after | Why |
|---|---|---|---|
| FALSIFY-ATTN-SUB-002 (forward threading) | PARTIAL_ALGORITHM_LEVEL | (eligible for FUNCTIONAL once contract YAML on main + this lands) | 4 emit() calls now thread the 4 stages in canonical order |
| FALSIFY-ATTN-SUB-005 (additive purity) | PARTIAL_ALGORITHM_LEVEL | (eligible) | accumulator allocation gated by plan.should_save() |

## Five Whys

1. **Why wire 4 stages, not 2?** QPostRope + KPostRope are pre-existing
   gaps in the parent contract; the same-file fix is a free side-effect
   per Toyota Way "all defects are mine".

2. **Why allocate accumulators only when requested?** O(num_heads * seq^2)
   memory shouldn't be paid on the default forward path. Plan-gating
   keeps the production inference path zero-overhead.

3. **Why insert capture at lines 133, 152, 160 specifically?**
   Per `evidence/ship-007-layer0-attn-bisection-2026-05-04/forward-traced-research.md`:
   line 133 = post Q/K/V copy (Q/K post-rope), line 152 = scores after
   scale (pre-softmax), line 160 = post-softmax probs.

4. **Why use scores_all.is_some() check vs always-allocate?**
   Always-allocate forces O(seq^2 * num_heads * 4) bytes per layer
   regardless of capture. Some(Vec) idiom plus is_some_and check is the
   idiomatic Rust pattern for conditional capture.

5. **Why this PR stacked on #1451 rather than off main?**
   Requires SaveTensorStage::AttnScores + AttnSoftmax variants, which only
   exist on #1451's branch. When #1451 merges, this rebases to main as a
   clean 51-line delta.

## Net effects

- 4 stages now wired in `forward_traced_with_plan`
- MODEL-1 ship %: unchanged at 91% (stays scaffold; ship % moves at
  FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle)
- MODEL-2 ship %: unchanged at 57%
- Cascade step 4/8 of §47.1 roadmap delivered

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…on bisection plan (2 new SaveTensorStage variants + 9-stage chain) (#1450)

* contract(trace-attn-sub-stages-v1): scaffold layer-0 attention bisection (5 new SaveTensorStage variants)

Authors a new provable-contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED
that pre-commits to the schema for extending `SaveTensorStage` with FIVE new
intermediate attention-block sub-stages so SHIP-007 layer-0 attention divergence
can be bisected element-wise against the HF FP16 oracle (PR #1423).

## Why now (per spec §46.7)

Spec v2.91.0 §46.7 ranked SHIP-007 layer-0 attention bisection as the highest-
leverage MODEL-1 follow-up. Memory `2026-05-03 SHIP-007 finding`:

- cos(APR.attn_norm, HF.attn_norm) = 0.99999995  ✓ (correct)
- cos(APR.attn_out,  HF.attn_out)  = 0.9966      ✗ (wrong)

The bug is somewhere INSIDE the attention block. The existing
`SaveTensorStage` enum has only `QkvMatmul` between `AttnNorm` and `AttnOut` —
too coarse to localize.

## What this contract pins

5 new variants, in computation order inside the attention block:

| New stage | What it captures |
|---|---|
| `QPostRope`   | Q after RoPE (post Q-projection + RoPE rotate) |
| `KPostRope`   | K after RoPE (GQA: shared across head groups) |
| `AttnScores`  | Q·Kᵀ / sqrt(head_dim), pre-softmax |
| `AttnSoftmax` | softmax(scores + causal_mask) |
| `AttnVOut`    | softmax · V (pre output O-projection) |

Capture order: `QkvMatmul → QPostRope → KPostRope → AttnScores → AttnSoftmax → AttnVOut → AttnOut`

## Falsifiers (5)

| ID | What it predicts | Status |
|---|---|---|
| FALSIFY-ATTN-SUB-001 | 5 new variants exist; existing 14 preserved byte-identical | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-002 | `forward_traced_with_plan` threads them in canonical order | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-003 | `apr diff --values` recognizes APRT files for the 5 stages | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-ATTN-SUB-004 | Bisection narrows SHIP-007 to ONE specific sub-stage | BLOCKER_FIXTURE_ABSENT |
| FALSIFY-ATTN-SUB-005 | Capture is purely additive (token output byte-identical) | PARTIAL_ALGORITHM_LEVEL |

FALSIFY-ATTN-SUB-004 is the load-bearing one — it is the predicate that must
be falsified to actually pinpoint the SHIP-007 sub-stage. Marked
BLOCKER_FIXTURE_ABSENT because live discharge requires (i) the 5 new stages
implemented, (ii) HF FP16 oracle extended to capture them, (iii) live diff on
RTX 4090. This contract pins the gate; the implementation cascade follows.

## Five Whys

1. **Why a new contract instead of extending `apr-cli-trace-save-tensor-v1`?**
   The parent contract is FUNCTIONAL (v1.4.0); extending it would re-open it.
   Mirrors the `trace-ffn-sub-block-v1` SHIP-007 layer-3 prior art (#1083) —
   sub-block contracts are siblings of the parent, not amendments.

2. **Why pin the schema before implementation?**
   Per `feedback_apr_trace_not_eprintln.md`: "Missing TraceStep granularity →
   extend the enum behind a contract." Contract-first preserves the audit
   chain spec § → contract → implementation PRs → live discharge.

3. **Why these 5 stages and not 3 or 7?**
   The 5 capture points bracket every numerically distinct intermediate
   inside attention: pre-RoPE (QkvMatmul exists), Q post-rope, K post-rope,
   scores (Q·Kᵀ), softmax (post-mask + softmax), V·softmax (pre O-proj).
   Adding sub-stages of these (e.g., separate Q vs K matmul outputs) is
   premature — let the bisection localize first, then refine if needed.

4. **Why mark FALSIFY-ATTN-SUB-004 as BLOCKER_FIXTURE_ABSENT and not PARTIAL?**
   PARTIAL_ALGORITHM_LEVEL means an algorithm reference exists today.
   ATTN-SUB-004's discharge requires LIVE evidence + the HF FP16 oracle
   extension; today neither exists. BLOCKER honestly classifies the gap;
   matches `apr-cli-distill-train-v1` TRAIN-009 precedent (§43, PR #1443).

5. **Why is this not just SHIP-007's fix itself?**
   Fixing SHIP-007 needs to know WHICH sub-stage is wrong. This contract
   delivers the *measurement instrument* that pinpoints the sub-stage; the
   fix is the next PR cascade after that pin lands.

## Net effects

- New contract `trace-attn-sub-stages-v1.yaml` v1.0.0 PROPOSED, 5 falsifiers.
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0.
- MODEL-1 ship %: unchanged at 91% (this is contract scaffold; no falsifier flips).
- MODEL-2 ship %: unchanged at 57%.
- Coverage tally: unchanged this PR (4 PARTIAL + 1 BLOCKER added but contract
  is new — they count once it''s wired into the §-amendment chain).
- Unblocks the next PR cascade: enum extension + forward_traced threading +
  apr diff recognition + HF FP16 oracle extension → FALSIFY-ATTN-SUB-001..005
  algorithm-bind → live RTX 4090 bisection → ATTN-SUB-004 DISCHARGE.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(trace-attn-sub-stages-v1): v1.0.0 → v1.1.0 — Toyota Way correction (only 2 new variants needed, not 5)

## What's wrong with v1.0.0

v1.0.0 (commit 475dec3) claimed FIVE new SaveTensorStage variants
were needed for the SHIP-007 layer-0 attention bisection:
QPostRope, KPostRope, AttnScores, AttnSoftmax, AttnVOut.

Empirical inspection of `crates/aprender-serve/src/inference_trace/save_tensor_stage.rs`
shows THREE of those five ALREADY EXIST in the parent contract
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 FUNCTIONAL:

- `QPostRope`  — already in enum (line 47)
- `KPostRope`  — already in enum (line 49)
- `Attention`  — already in enum (line 51), semantically my "AttnVOut"
                 ("post softmax(Q@Kᵀ)@v, pre O-proj")

Only TWO are truly missing:

- `AttnScores`   — Q·Kᵀ / sqrt(head_dim), pre-softmax
- `AttnSoftmax`  — softmax(scores + causal_mask), pre-V

## Why it happened

Per `feedback_no_guessing.md`: should have run
`pmat query SaveTensorStage` BEFORE authoring v1.0.0. Instead I
extrapolated from the parent contract description without reading
the live enum source. Toyota Way andon — caught on next iteration.

Per `feedback_toyota_way_all_defects.md`: all defects are mine.
Fixing at the contract level BEFORE any implementation PR depends
on the wrong scope is exactly the cost-of-defect minimization
the toolchain is designed for.

## What v1.1.0 does

- Bumps version 1.0.0 → 1.1.0 PROPOSED (still pre-FUNCTIONAL)
- Reduces "new variants" from 5 to 2: AttnScores + AttnSoftmax
- Documents the FULL 9-stage layer-0 bisection chain spanning
  parent-contract stages + 2 new ones:

  attn_norm → qkv_matmul → qkv_bias → q_post_rope → k_post_rope
  → attn_scores [NEW] → attn_softmax [NEW] → attention → attn_out

- Updates all 5 falsifiers (SUB-001..005) to reflect reduced scope
- Adds bisection_chain_layer_0 equation pinning the 9-element
  cosine sequence (with empirical state per memory
  `2026-05-03 SHIP-007 finding`: cos[0]=0.99999995, cos[8]=0.9966)
- FALSIFY-ATTN-SUB-004 still BLOCKER_FIXTURE_ABSENT (pending HF
  FP16 oracle extension to capture 2 new stages on RTX 4090)

## Five Whys

1. **Why did v1.0.0 claim 5 new variants?**
   Authored without reading the live save_tensor_stage.rs source.

2. **Why didn't I read the source first?**
   Skipped the `pmat query SaveTensorStage` step that
   `feedback_no_guessing.md` mandates. Worked from the parent
   contract description's prose ("Embedding, AttnNorm, QkvMatmul,
   AttnOut, ...") which truncated 18 stages to 14.

3. **Why was the parent contract description truncated?**
   Doc-comment in `forward_traced_with_plan` rust source listed
   only 14 stages (the per-layer canonical-FFN order, omitting
   QkvBias + the parent's renamed Attention). My contract reused
   that prose instead of reading the enum directly.

4. **Why does this matter for SHIP-007 ship %?**
   It doesn't yet — the contract is still scaffold scope, no
   implementation PR has shipped against the wrong scope. v1.1.0
   correction lands BEFORE the cascade triggers.

5. **Why amend the contract instead of opening a sibling fix-PR?**
   Same branch (#1450) is the right place. Toyota Way: stop the
   line, fix the defect at source, then continue. A sibling PR
   would split the audit story across two commits with no benefit.

## Net effects

- Contract `trace-attn-sub-stages-v1` v1.0.0 → **v1.1.0 PROPOSED**
- `pv validate contracts/trace-attn-sub-stages-v1.yaml` exits 0
- MODEL-1 ship %: unchanged at 91% (this is contract correction)
- MODEL-2 ship %: unchanged at 57%
- Implementation cascade now correctly scoped to 2 new variants,
  not 5 — saves an estimated 60% of the enum-extension PR's LOC

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
… integrated, M-FFN-GGUF-3 DISCHARGED (#1534)

Same-day post-M88+M89 follow-up: ship-two-models-spec.md v2.72.0
§27 records that the H1/H2 bisection has ALREADY been LIVE-run on
noah-Lambda-Vector RTX 4090 on 2026-04-27 (built `apr` from PR
#1083 branch + commits 77c016b + c657968 + f249464):

  APR layer-3 ffn_swigl std  = 1.2216
  GGUF layer-3 ffn_swigl std = 0.0670
  Ratio                       = 18.23×
  Verdict                     = H2 CONFIRMED (APR-side bug)

This far exceeds the §26.4 ≥10× threshold by 8× absolute.

Status promotions in v1.1.0:
- M-FFN-GGUF-3 implementation_stage: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED
- FALSIFY-FFN-GGUF-003: PROPOSED → DISCHARGED
- contract metadata.status: PROPOSED → ACTIVE_ALGORITHM_LEVEL

The M89 PR #1533 harness (falsify_ffn_gguf_003_layer_3_swigl_h1_h2_bisection)
adds regression-test coverage for any future re-run; the §27 data
remains the canonical operator-dispatched discharge proof.

Only M-FFN-GGUF-4 (SHIP-007 fix PR) remains PENDING — gated on
engineering investigation of `inference.rs` SwiGLU site (line
shifted to 298-302 post sub-FFN telemetry from §22 spec authoring
at :160-164).

3 candidate hypotheses for the layer-3-specific behavior within
the SwiGLU block authored in v1.1.0 amendment for M-FFN-GGUF-4
investigation:
- H2a: Buffer aliasing / scratch-buffer corruption in APR multi-token
- H2b: Layer-3-specific upstream divergence (gate or up at L3 only)
- H2c: Quantization dequant alignment differs at certain layer configs

YAML-only — production hot paths byte-unchanged (this amendment
records pre-existing §27 evidence + corrects status drift).

Methodology lesson #2 firing in retrospect: had I grep'd the spec
for §22 / §27 BEFORE authoring M88's contract scaffold, the
M-FFN-GGUF-3 status would have been DISCHARGED at v1.0.0 instead
of needing this v1.1.0 follow-up amendment.

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 11, 2026
…/6 sweep

Algorithm-level PARTIAL discharge for FALSIFY-APR-GGUF-PARITY-002
through 006 per `contracts/apr-vs-gguf-forward-parity-v1.yaml`.
Combined with PARITY-001 (already bound), this closes 6/6.

## ✅ Closes 6/6 apr-vs-gguf-forward-parity-v1 sweep

**Twelve contract families now fully algorithm-bound at PARTIAL:**
- `dataset-thestack-python-v1` (7/7)
- `tokenizer-bpe-v1` (7/7)
- `apr-cli-publish-v1` (4/4)
- `apr-cli-qa-v1` (10/10)
- `apr-cli-coverage-v1` (1/1)
- `apr-cli-operations-v1` (7/7)
- `apr-cli-command-safety-v1` (4/4)
- `apr-cli-publish-extra-v1` (10/10)
- `apr-cli-dep-migration-v1` (2/2)
- `apr-cli-distill-train-v1` (9/9)
- `apr-cli-pull-dataset-v1` (8/8)
- `apr-vs-gguf-forward-parity-v1` (6/6) ← this PR

## Why this matters for SHIP-007 / MODEL-1 ship

The SHIP-007 dispatch-layer bug is the actual blocker for
MODEL-1 GPU ship (per `feedback_model_1_ships_gpu_only`). This
contract pins the parity gates the eventual fix must satisfy:

- PARITY-002: layer-3 ffn_swigl ratio in `[0.5, 2.0]` (the
  18.23× ratio observed in `2026-04-26 SHIP-007 narrowing`
  session would Fail).
- PARITY-003: layer-3 ffn_gate ratio in `[0.7, 1.4]` (tighter
  band — gate matmul is the pinned root cause per §28).
- PARITY-004 + 005: contract validity + non-Q4K regression.
- PARITY-006: 28-layer ffn_swigl trace coverage (regression
  guard for PR cascade #1081/#1082/#1083).

When the SHIP-007 fix lands, all 6 verdicts must Pass. This
verdict pin gives the fix a concrete acceptance criterion at
algorithm level.

## Verdict shapes

- 002, 003: bounded-ratio with finite-check (catches NaN/±∞).
- 004, 005: shared exit-code-zero verdict.
- 006: count-threshold (≥ 28).

## Five-Whys

1. Why bind these now? — Closes 6/6 sweep; pins SHIP-007
   acceptance criterion at algorithm level.
2. Why distinct ratio bands for 002 + 003? — 003 (gate matmul)
   is the pinned root cause; tighter band means more sensitive
   regression detection at the bisected location.
3. Why share verdict for 004+005? — Identical exit-code-zero
   reduction.
4. Why pin 28-line min for 006? — Canonical 28-layer
   Qwen2.5-Coder-7B teacher; PR cascade regression guard.
5. Why 24 tests across 4 verdict sections? — Pass band +
   boundary + below/above + NaN/Inf + provenance per ratio
   verdict; minimal exit-code coverage; min-line boundary.

## Cross-reference

PARITY-002's `p002_fail_18_23x_ship_007_baseline` test
explicitly captures the observed regression value from
`2026-04-26 session SHIP-007 narrowing` memory — provides a
named regression-class sentinel for any future SHIP-007 work.

## Tests

24 unit tests, all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant