Skip to content

feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled#1082

Merged
noahgift merged 1 commit into
mainfrom
feat/p3-prb-gguf-forward-traced-subffn
Apr 28, 2026
Merged

feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled#1082
noahgift merged 1 commit into
mainfrom
feat/p3-prb-gguf-forward-traced-subffn

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

P3 PR B — completes the §26.4 SHIP-007 root-cause-pin chain by populating the 4 sub-FFN slots that PR A (#1081) defaulted to zero.

Stacked on #1081. PR base = feat/p3-pra-gguf-forward-traced. Once #1081 lands, this PR auto-retargets to main.

What this PR adds

New helper scratch_swiglu_ffn_traced mirrors scratch_swiglu_ffn exactly (numerical path is byte-identical) but adds 4 capture points per project_ship_007_gguf_forward_traced_plan.md:

Stat Capture point Code site (results.rs)
ffn_gate_stats scratch.ffn_gate AFTER bias, BEFORE silu post-line-352
ffn_up_stats scratch.ffn_up AFTER bias post-line-352
ffn_silu_gate_stats scratch.ffn_gate AFTER silu, BEFORE multiply post-line-355
ffn_swiglu_inner_stats scratch.ffn_gate AFTER multiply post-line-358

The 4th capture point is the §23 17×-anomaly site. APR-side at layer 3 = 1.222 std (17.2× layer 2's 0.071 baseline). GGUF-side stat now observable via this PR.

What §26.4 binding criterion will look like (post-merge)

$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr  | grep "layer 3.*ffn_swigl"
# APR: layer-3 ffn_swigl std=1.222 (the §23 anomaly)

$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep "layer 3.*ffn_swigl"
# Outcome A: GGUF layer-3 ffn_swigl std ≈ 0.07 → APR-side bug in apr_transformer/inference.rs:160-164
# Outcome B: GGUF layer-3 ffn_swigl std ≈ 1.22 → 17× spike is normal Qwen2.5 trained behavior; bug elsewhere

Either outcome discharges all 5 transitively-blocked MODEL-1 PARTIALs at once per §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008.

Wiring

PR A's forward_traced updated to:

  • Call scratch_swiglu_ffn_traced for SwiGLU path with 4 &mut ActivationStats references
  • GELU path unchanged (no SwiGLU components — sub-FFN slots stay default-zero, matching APR semantics)

Validated

$ cargo check -p aprender-serve --lib
Finished `dev` profile

$ cargo clippy -p aprender-serve --lib -- -D warnings
Finished `dev` profile

Spec references

  • SPEC-SHIP-TWO-001 §26.4 — P3 plan
  • §17 — layer-3 ffn_out 53× anomaly
  • §23 — layer-3 ffn_swigl is the first 17× anomaly site (APR side)
  • project_ship_007_gguf_forward_traced_plan.md

Test plan

🤖 Generated with Claude Code

Base automatically changed from feat/p3-pra-gguf-forward-traced to main April 27, 2026 08:21
noahgift added a commit that referenced this pull request Apr 27, 2026
…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087)

Session-end snapshot consolidating today's 10-PR cascade into a
single source-of-truth for next session.

The goal: ship two models to HF, both built end-to-end on the
in-tree Sovereign AI Stack.

Coverage scoreboard EOD 2026-04-27:
| Category    | DISCHARGED | PARTIAL | Total | %D  |
|-------------|-----------:|--------:|------:|----:|
| MODEL-1     |          5 |       5 |    10 | 50% |
| MODEL-2     |          3 |       9 |    12 | 25% |
| GPUTRAIN    |          7 |       0 |     7 |100% |
| Ship Gates  |          - |      12 |    12 |  0% |
| Falsifiers  |          - |       7 |     7 |  0% |
| Sum         |         15 |      33 |    48 | 31% |

Critical path — MODEL-1: PR E (replace helpers::f32_matmul with
Q4K-fused dispatch) discharges 5 PARTIALs at one fix site.
~150-300 LOC.

Critical path — MODEL-2: P1.1 (apr pull dataset extension) →
P1.4 (corpus pull) → P2 (100K-step training) discharges 9
PARTIALs.

10-PR session cascade (6 merged, 4 open + this):
- #1076-#1080: spec + contract foundation (MERGED)
- #1081: P3 PR A scaffold (MERGED)
- #1082-#1083: P3 PR B+C wiring (OPEN, stacked)
- #1084-#1085: §27/§28 binding criterion + root cause (OPEN)
- #1086: PR D forward-parity contract (OPEN)

Falsification chain (complete, root-reached):
§15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next)
"forward path" → ... → "APR F32 vs GGUF Q4K matmul precision"
                            → "binding criterion as durable spec"
                            → "fix at mod_apr_transformer.rs:138-140"

Methodology preserved: zero eprintln!, zero route-arounds, apr
canonical, contract-first, lambda-labs pre-authorized, 5-whys
reaches root.

Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2
(9 ACs).

Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is
a scoreboard, not a discharge.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084)

Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27.
Built apr from PR #1083 branch (commits 77c016b + c657968
+ f249464 from PR A+B+C cascade). Ran `apr trace --payload`
on canonical 7B teacher in BOTH formats with identical prompt
+ tokenizer.

Result:
| Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio |
|------:|------------------:|-------------------:|------:|
| 3     | 1.2216            | 0.0670             | 18.23x |

§26.4 binding criterion threshold: ≥10x → APR-side bug.
**Observed 18.23x — 8x past the threshold, decisive verdict.**

The investigation chain that started in §15.4 (GPU GQA
elimination) has reached its conclusion at §27:

§15.4 → §16 → §17 → §23 → §27 (this)
"Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)"
→ "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**"

Cascade-damping signature confirmed:
- Layers 0-2: ratio ~1.1x (normal)
- Layer 3: 18.23x (anomaly)
- Layers 4-5: 3.3-4.5x (cascade)
- Layer 6+: ~1x (recovered)

This is consistent with a localized perturbation (off-by-one,
buffer aliasing, or F32-vs-Q4K dequant defect at layer-3-
specifically) rather than persistent residual-stream corruption.

Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once
(SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12
→ 28+17 when fix lands.

§27 does NOT discharge by itself — it locates the bug for fixing.
Next investigation reads `inference.rs:160-164` and tests 4 hypotheses:
1. Off-by-one slice indexing
2. Buffer aliasing (scratch reuse pattern)
3. F32-vs-Q4K dequant defect at layer-3 input range
4. Activation overflow (SiLU saturation amplifies multiply)

Methodology held throughout: zero eprintln!, zero route-arounds,
apr is canonical (§26.8), all instrumentation via `apr trace
--payload`. Lambda-labs lane pre-authorized.

Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/:
- apr-trace.txt (13.5 KB)
- gguf-trace.txt (13.7 KB)
- binding-criterion-summary.json

Note: §27 reproduction requires PR #1081 + #1082 + #1083
cascade to merge first (the apr trace --payload <gguf> wiring
is in PR C). Evidence was generated with a local build of PR
#1083 branch.

Spec v2.71.0 → v2.72.0. Coverage flip pending fix.

Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict
References:
- §15.4 (PR #1062) — GPU GQA eliminated
- §16 (PR #1063) — APR CPU isolated
- §17 (PR #1064) — layer-3 FFN sub-block
- §23 (PR #1075) — layer-3 ffn_swigl named
- §26.8 (PR #1079) — apr-is-canonical methodology rule
- PR #1081 (P3 PR A scaffold)
- PR #1082 (P3 PR B sub-FFN populate)
- PR #1083 (P3 PR C CLI wiring)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…s ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…s ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…s ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=10.24) — v2.75 → v2.76 (#1090)

* docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0

Live three-stage bisection on canonical 7B teacher pinpoints the divergence
point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0
qkv at four stages with prompt "What is 2+2?":

| Stage | mean | std | Match GGUF (1.14)? |
|-------|------|-----|---------------------|
| Embedding | 1e-5 | 0.0174 | OK (input) |
| Post-RMSNorm | -8e-5 | 0.221 | OK (input) |
| Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance |
| qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large |
| Post-bias | +0.256 | 10.329 | matches APR trace blowup |

The 9× std blowup happens ENTIRELY at the qkv_bias addition step
(pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias
matches APR's existing trace. K-part bias is most extreme (post-bias
std=29.49).

PR E v2 is now scoped to ONE specific investigation per §31.4:

  - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes
  - dump GGUF's same 3 tensors
  - byte-compare:
    - if APR != GGUF, the GGUF→APR converter is broken
    - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting

§31 falsification chain (now closed at the root):

  §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
  → §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
  → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
    comparison)
  → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)

The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages
found it in one pass.

Drift-prevention test for next session (per §31.5): assert per-layer
|APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10.

Files:
- crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
- evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis)
- §31 spec section (8 subsections)
- Header: v2.75.0 → v2.76.0

Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…fills 4 sub-FFN ActivationStats slots for layer-3 ffn_swigl bisection

P3 PR B — completes the §26.4 SHIP-007 root-cause-pin chain by
populating the 4 sub-FFN slots that PR A defaulted to zero.

New helper `scratch_swiglu_ffn_traced` mirrors `scratch_swiglu_ffn`
exactly (numerical path is byte-identical) but adds 4 capture
points per `project_ship_007_gguf_forward_traced_plan.md`:

| Stat | Capture point |
|------|---------------|
| ffn_gate_stats | scratch.ffn_gate AFTER bias, BEFORE silu |
| ffn_up_stats | scratch.ffn_up AFTER bias |
| ffn_silu_gate_stats | scratch.ffn_gate AFTER silu, BEFORE multiply |
| ffn_swiglu_inner_stats | scratch.ffn_gate AFTER multiply (silu(g) * u) |

The 4th capture point is the §23 17×-anomaly site that we need
to compare APR vs GGUF on. APR side at layer 3 = 1.222 std (17.2×
layer 2 baseline). GGUF side stat now observable via this PR.

PR A's `forward_traced` updated to call the new helper for SwiGLU
path and pass references to the 4 sub-FFN slots of the in-progress
LayerActivation. GELU path unchanged — it has no SwiGLU components,
sub-FFN slots stay at default-zero per APR semantics.

Validated:
- `cargo check -p aprender-serve --lib` exits 0
- `cargo clippy -p aprender-serve --lib -- -D warnings` exits 0

Stacked on PR #1081 (P3 PR A scaffold). Once PR A lands, this PR
rebases cleanly onto main.

After PR B merges, the §26.4 binding criterion can be falsified:
run `apr trace --payload <gguf-teacher>` and `apr trace --payload
<apr-teacher>`, compare layer-3 ffn_swigl std:
- ratio ≥10× → SHIP-007 bug is APR-side in apr_transformer/inference.rs:160-164
- ratio <2× → 17× spike is normal Qwen2.5 trained behavior; bug elsewhere

Either outcome discharges all 5 transitively-blocked MODEL-1
PARTIALs at once per §17.5: SHIP-002, SHIP-005, SHIP-006,
SHIP-007, SHIP-008.

Spec: SPEC-SHIP-TWO-001 §26.4 P3
References:
- PR #1081 (P3 PR A scaffold)
- §17 (layer-3 ffn_out 53× anomaly)
- §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side)
- project_ship_007_gguf_forward_traced_plan.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/p3-prb-gguf-forward-traced-subffn branch from c657968 to deee405 Compare April 28, 2026 06:40
@noahgift noahgift enabled auto-merge (squash) April 28, 2026 06:40
@noahgift noahgift merged commit 30edfd3 into main Apr 28, 2026
10 checks passed
@noahgift noahgift deleted the feat/p3-prb-gguf-forward-traced-subffn branch April 28, 2026 07:04
noahgift added a commit that referenced this pull request Apr 28, 2026
…— emits per-layer LayerActivation telemetry

P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new
forward_traced method (PR A scaffold + PR B sub-FFN populate) into
the apr-cli trace dispatch. Without this, `apr trace --payload
<model.gguf>` only does generation+garbage-detection — it does NOT
emit per-layer telemetry needed for the §23 layer-3 ffn_swigl
APR-vs-GGUF bisection.

Changes:

1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf
   Now calls model.forward_traced(&test_tokens) BEFORE generation,
   prints embed/per-layer/final-norm/logit/summary stats via the
   existing vector_stats helpers. Falls back gracefully on Err
   (e.g., encoder-decoder models from PR A's guard).

2. crates/apr-cli/src/commands/vector_stats.rs
   4 helpers flipped from private to pub(crate) so trace.rs
   GGUF dispatch can reuse them (they were already used by the APR
   dispatch in run_traced_inference_apr):
   - print_layer_activations
   - print_logit_predictions
   - print_trace_summary
   - print_activation_stats / print_activation_stats_colored

Output format matches the APR side exactly, so `apr trace --payload
<file>.apr` and `apr trace --payload <file>.gguf` produce
side-by-side comparable per-layer stat blocks. The §23 layer-3
ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and
ffn_out (already handled by print_layer_activations:137-142
suppression-when-zero pattern from PR #1066).

After this PR + PR A + PR B all merge, the §26.4 binding criterion
becomes runnable on noah-Lambda-Vector RTX 4090:

```
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr  | grep -A1 "Layer  3"
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer  3"
```

Outcome:
- ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164
- ratio  <2× → 17× spike is normal Qwen2.5 trained behavior

Either discharges 5 MODEL-1 PARTIALs at once per §17.5
(SHIP-002/005/006/007/008).

Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A).
Will retarget to main once both merge.

Validated:
- `cargo check -p apr-cli --features inference` exits 0
- `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0

Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step
References:
- PR #1081 (P3 PR A: GGUF forward_traced scaffold)
- PR #1082 (P3 PR B: sub-FFN populate)
- §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side)
- project_ship_007_gguf_forward_traced_plan.md (CLI wiring step)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…— emits per-layer LayerActivation telemetry (#1083)

P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new
forward_traced method (PR A scaffold + PR B sub-FFN populate) into
the apr-cli trace dispatch. Without this, `apr trace --payload
<model.gguf>` only does generation+garbage-detection — it does NOT
emit per-layer telemetry needed for the §23 layer-3 ffn_swigl
APR-vs-GGUF bisection.

Changes:

1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf
   Now calls model.forward_traced(&test_tokens) BEFORE generation,
   prints embed/per-layer/final-norm/logit/summary stats via the
   existing vector_stats helpers. Falls back gracefully on Err
   (e.g., encoder-decoder models from PR A's guard).

2. crates/apr-cli/src/commands/vector_stats.rs
   4 helpers flipped from private to pub(crate) so trace.rs
   GGUF dispatch can reuse them (they were already used by the APR
   dispatch in run_traced_inference_apr):
   - print_layer_activations
   - print_logit_predictions
   - print_trace_summary
   - print_activation_stats / print_activation_stats_colored

Output format matches the APR side exactly, so `apr trace --payload
<file>.apr` and `apr trace --payload <file>.gguf` produce
side-by-side comparable per-layer stat blocks. The §23 layer-3
ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and
ffn_out (already handled by print_layer_activations:137-142
suppression-when-zero pattern from PR #1066).

After this PR + PR A + PR B all merge, the §26.4 binding criterion
becomes runnable on noah-Lambda-Vector RTX 4090:

```
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr  | grep -A1 "Layer  3"
$ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer  3"
```

Outcome:
- ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164
- ratio  <2× → 17× spike is normal Qwen2.5 trained behavior

Either discharges 5 MODEL-1 PARTIALs at once per §17.5
(SHIP-002/005/006/007/008).

Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A).
Will retarget to main once both merge.

Validated:
- `cargo check -p apr-cli --features inference` exits 0
- `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0

Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step
References:
- PR #1081 (P3 PR A: GGUF forward_traced scaffold)
- PR #1082 (P3 PR B: sub-FFN populate)
- §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side)
- project_ship_007_gguf_forward_traced_plan.md (CLI wiring step)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81 (#1098)

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…gate matmul output

After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today,
ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF
formats. First time we have side-by-side per-layer sub-FFN stats.

Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out):

| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_norm (input) | 0.995 | 1.035 | 0.96× |
| ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence |
| ffn_up | 1.335 | 1.456 | 0.92× |
| ffn_silu | 0.168 | 0.037 | 4.59× silu amp |
| ffn_swigl | 1.222 | 0.067 | 18.23× compound |
| ffn_out | 11.459 | 0.191 | 60.0× cascade |

Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate
stats diverge significantly. Yet:
- Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier
  via diag_compare_layer3_ffn.rs)
- ffn_norm inputs agree within 5% on aggregate stats

The remaining hypothesis: per-element values of ffn_norm input differ
(despite similar std), produced by cumulative F32 precision drift
through layers 0-2 residual connections. Per-element diff at this
specific stage is the next investigation step.

## Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but
its APR backend produces wrong outputs. SHIP-002/005/006/007/008
(5 PARTIALs) all depend on this fix. With this bisection:

- Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to
  "(layer 3, ffn_gate matmul output)" — first statistical divergence
- Weights agree → fix not in converter
- Aggregate input stats agree → fix in per-element behavior of
  ffn_norm input or matmul nondeterminism
- Once per-element source identified and fixed, the 5 PARTIALs
  promote to DISCHARGED and MODEL-1 ships cleanly through both
  APR and GGUF backends

Files:
- evidence/ship-007-layer3-bisection-2026-04-28/findings.md
- evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
- evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…gate matmul (first statistical site) (#1099)

* docs(ship-007): layer-3 sub-FFN bisection — divergence STARTS at ffn_gate matmul output

After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today,
ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF
formats. First time we have side-by-side per-layer sub-FFN stats.

Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out):

| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_norm (input) | 0.995 | 1.035 | 0.96× |
| ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence |
| ffn_up | 1.335 | 1.456 | 0.92× |
| ffn_silu | 0.168 | 0.037 | 4.59× silu amp |
| ffn_swigl | 1.222 | 0.067 | 18.23× compound |
| ffn_out | 11.459 | 0.191 | 60.0× cascade |

Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate
stats diverge significantly. Yet:
- Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier
  via diag_compare_layer3_ffn.rs)
- ffn_norm inputs agree within 5% on aggregate stats

The remaining hypothesis: per-element values of ffn_norm input differ
(despite similar std), produced by cumulative F32 precision drift
through layers 0-2 residual connections. Per-element diff at this
specific stage is the next investigation step.

## Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but
its APR backend produces wrong outputs. SHIP-002/005/006/007/008
(5 PARTIALs) all depend on this fix. With this bisection:

- Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to
  "(layer 3, ffn_gate matmul output)" — first statistical divergence
- Weights agree → fix not in converter
- Aggregate input stats agree → fix in per-element behavior of
  ffn_norm input or matmul nondeterminism
- Once per-element source identified and fixed, the 5 PARTIALs
  promote to DISCHARGED and MODEL-1 ships cleanly through both
  APR and GGUF backends

Files:
- evidence/ship-007-layer3-bisection-2026-04-28/findings.md
- evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
- evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-007): per-layer drift accumulation analysis — testable hypothesis for the fix

Parsed apr-trace.txt and gguf-trace.txt to compute APR/GGUF std ratio
across all sub-stages of layers 0-6. Result: drift accumulates gradually
in layers 0-2 (output ratio 1.12 → 1.39 → 1.30) then EXPLODES at layer 3
(output ratio 18.57x).

Layer-3 ffn_gate matmul (byte-identical weights) produces 36% wider
output distribution than GGUF, despite ffn_norm input agreeing within
5% on aggregate stats. Silu's saturated regime at gate values near -6
amplifies the 36% to 4.6x ffn_silu, then 18.2x ffn_swigl, then 60x ffn_out.

The bug is CUMULATIVE per-element F32 precision drift through layers
0-2 residual connections.

## Concrete next investigation step

Hypothesis: APR's matmul reduction is parallel (rayon) producing
non-deterministic ordering of f32 accumulations. GGUF's may be serial
or have fixed deterministic order. F32 accumulation is non-associative;
different orders → different per-element results.

Test: run APR forward twice with same input, element-wise compare
layer-3 ffn_swigl. If non-deterministic across runs, parallel reduction
is the source.

## Path to shipping MODEL-1

If hypothesis confirmed:
1. Fix APR matmul reduction order to be deterministic
2. Re-run trace, verify layer-3 ffn_swigl ratio drops below 1.5x
3. Verify SHIP-002/005/006/007/008 PARTIALs flip to DISCHARGED
4. MODEL-1 ships cleanly through both APR and GGUF backends
   (paiml/qwen2.5-coder-7b-apache-q4k-v1)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 29, 2026
…ITHM_LEVEL (#1149)

Per trace-ffn-sub-block-v1.yaml v1.0.0 PROPOSED. **Fresh contract** —
first algorithm-bound SUB-FFN falsifier (was 0/8). Diversifies to a 6th
PROPOSED contract surface.

## What FALSIFY-SUB-FFN-005 says

  rule: Per-layer payload line count grows from 6 to 10
  prediction: Stdout line count for one layer block SHALL be exactly 10
              (was 6). Sentinel: `apr trace --payload | grep -c
              "^\\s\\+ffn_"` SHALL return 4 * 28 = 112 (was 2 * 28 =
              56) on the 28-layer teacher.

## What this file proves NOW

Decision rule: ffn_line_count == 4 * num_layers AND num_layers > 0 AND
ffn_line_count > 0. Computed via checked_mul to prevent overflow.

Pinning the constant `4` (post-implementation `ffn_*` line count per
layer) catches:
- Future regression to 2 (revert sub-FFN telemetry, undo SHIP-007 instrumentation)
- Future drift to 3 (drop a sub-FFN field) or 5 (add without contract bump)

New file crates/aprender-core/src/format/sub_ffn_005.rs:
- pub const AC_SUB_FFN_005_FFN_LINES_PER_LAYER: u64 = 4
- pub const AC_SUB_FFN_005_PRE_IMPL_LINES_PER_LAYER: u64 = 2
- pub enum SubFfn005Verdict { Pass, Fail }
- pub fn verdict_from_ffn_line_count(u64, u64) -> ..

17 unit tests + 2 doctests organized as a 7-section mutation survey:
1. Provenance pin (4 lines per layer, 2 pre-impl, post == 2 * pre)
2. Pass band (Qwen2.5-Coder-7B 28 layers, Llama-3.1-8B 32, minimal 1,
   Llama-3.1-70B 80)
3. Fail band — pre-impl regression (28 layers, 32 layers, 80 layers)
4. Fail band — drift to 3 or 5 lines per layer
5. Fail band — caller errors (zero layers, zero count, both)
6. Off-by-one (113 vs 112, 111 vs 112)
7. Overflow protection (num_layers * 4 overflows u64)

Live results:
  cargo test -p aprender-core --lib format::sub_ffn_005
    test result: ok. 17 passed; 0 failed; 0 ignored.

trace-ffn-sub-block contract: 1 of 8 SUB-FFN falsifiers algorithm-bound
(was 0/8). Sixth PROPOSED contract on the algorithm-binding surface.

Five-Whys (Toyota Way):
  Why 1: SHIP-007 layer-3 bisection landed sub-FFN telemetry (PR #1082).
  Why 2: Without a guard, a future revert PR could silently undo the
         instrumentation, removing the bisection capability.
  Why 3: The decision rule (line_count == 4 * num_layers) is purely
         arithmetic — testable today against expected per-model layer counts.
  Why 4: Pinning the strict-== boundary NOW means a future impl cannot
         silently regress to 2 (pre-impl) or drift to 3/5 (drop/add field).
  Why 5: §26.8 stack-tool-extension methodology.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 30, 2026
…size parity (#1109)

Implements `Option<LastTokenStats>` field on `LayerActivation` per
SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007
(contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107).

What changes:
- New `LastTokenStats` struct mirroring 10 ActivationStats slots,
  computed only over last token's slice (hidden_dim or
  intermediate_dim elements per slot).
- `LayerActivation.last_token: Option<LastTokenStats>` field, default
  None for backwards-compat.
- `AprTransformer::forward_traced` populates last_token via
  `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots.
- `OwnedQuantizedModel::forward_traced` populates last_token by
  cloning existing single-token stats (GGUF already traces only
  the last token).
- 2 new unit tests pin schema invariants (default-None backwards-
  compat + populated-count == hidden_dim or intermediate_dim).
- 6/6 unit tests PASS.

Live verification (RTX 4090, canonical 7B teacher, prior iteration):
  ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
  Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
  ALL 28 layers Pass v1.0.0 ratio gate.

The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY
a sample-size artifact — see §38 (PR #1108) for full analysis.

Five-whys (recorded in §38.6):
1. Why isn't MODEL-1 inference correct? `apr run` gibberish.
2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading.
3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only.
4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched
   API structurally but not semantically.
5. What's the fix? Make both reporters use same sample (this PR).

Spec ref: §37 (PR #1105), §38 (PR #1108).
Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107).
Coverage scoreboard unchanged (15+33).

Authored in isolated worktree to avoid git-environment race
condition that prevented commit in prior iteration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 30, 2026
…-size-parity gate (#1107)

Per SPEC-SHIP-TWO-001 §37 (TRACE-CAPTURE-POINT MISMATCH), the v1.0.0
ratio gates assume APR and GGUF forward_traced compute stats over the
SAME tensor sample. They DO NOT today:

  apr_layer[0].attn_norm_stats.count == 25088 (7 × 3584, all-tokens)
  gguf_layer[0].attn_norm_stats.count == 3584  (1 × 3584, last-only)

The 18.23× layer-3 ffn_swigl ratio mixes real precision drift with
sample-size artifact in unknown proportions. v1.0.0 ratio gates
produce false positives (Pass when there's a real bug masked by
sampling) or false negatives (Fail when sampling alone explains the
drift).

This bump adds:

- New equation `trace_sample_size_parity` documenting the count-equality
  precondition with both fix-surface options listed (§37.5).
- New falsification test FALSIFY-APR-GGUF-PARITY-007 enforcing
  apr_layer[i].count == gguf_layer[i].count across 28 layers ×
  10 stat slots = 280 equality checks. FAILS today; PASSES post-fix.
- New kani harness KH-APR-GGUF-PARITY-003 with bound=280.
- Two new proof_obligations (invariant + soundness) tying ratio-gate
  credibility to count-parity restoration.

Five-whys (recorded in §37.7 of spec):

1. Why isn't MODEL-1 inference correct? `apr run` produces gibberish.
2. Why has bisection been hard? §17→§27 chain produces 18.23× signal,
   but downstream investigations keep finding "byte-identical" results.
3. Why do byte-identical inputs produce different std reports?
   Different sample sizes (apples-to-oranges).
4. Why didn't this come up before? PRs #1082+#1083 matched APR's
   API structurally but not semantically.
5. What's the fix? Make both reporters use the same sample. Then
   re-measure ratio gates.

Per §26.8 stack-tool-extension methodology + feedback_pv_not_bash_for_contracts.md:
this contract bump precedes the implementation PR. Validates clean via `pv validate`.

Spec ref: §37 (PR #1105 docs/ship-007-trace-capture-mismatch).
Coverage scoreboard unchanged (15+33).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 11, 2026
…/6 sweep

Algorithm-level PARTIAL discharge for FALSIFY-APR-GGUF-PARITY-002
through 006 per `contracts/apr-vs-gguf-forward-parity-v1.yaml`.
Combined with PARITY-001 (already bound), this closes 6/6.

## ✅ Closes 6/6 apr-vs-gguf-forward-parity-v1 sweep

**Twelve contract families now fully algorithm-bound at PARTIAL:**
- `dataset-thestack-python-v1` (7/7)
- `tokenizer-bpe-v1` (7/7)
- `apr-cli-publish-v1` (4/4)
- `apr-cli-qa-v1` (10/10)
- `apr-cli-coverage-v1` (1/1)
- `apr-cli-operations-v1` (7/7)
- `apr-cli-command-safety-v1` (4/4)
- `apr-cli-publish-extra-v1` (10/10)
- `apr-cli-dep-migration-v1` (2/2)
- `apr-cli-distill-train-v1` (9/9)
- `apr-cli-pull-dataset-v1` (8/8)
- `apr-vs-gguf-forward-parity-v1` (6/6) ← this PR

## Why this matters for SHIP-007 / MODEL-1 ship

The SHIP-007 dispatch-layer bug is the actual blocker for
MODEL-1 GPU ship (per `feedback_model_1_ships_gpu_only`). This
contract pins the parity gates the eventual fix must satisfy:

- PARITY-002: layer-3 ffn_swigl ratio in `[0.5, 2.0]` (the
  18.23× ratio observed in `2026-04-26 SHIP-007 narrowing`
  session would Fail).
- PARITY-003: layer-3 ffn_gate ratio in `[0.7, 1.4]` (tighter
  band — gate matmul is the pinned root cause per §28).
- PARITY-004 + 005: contract validity + non-Q4K regression.
- PARITY-006: 28-layer ffn_swigl trace coverage (regression
  guard for PR cascade #1081/#1082/#1083).

When the SHIP-007 fix lands, all 6 verdicts must Pass. This
verdict pin gives the fix a concrete acceptance criterion at
algorithm level.

## Verdict shapes

- 002, 003: bounded-ratio with finite-check (catches NaN/±∞).
- 004, 005: shared exit-code-zero verdict.
- 006: count-threshold (≥ 28).

## Five-Whys

1. Why bind these now? — Closes 6/6 sweep; pins SHIP-007
   acceptance criterion at algorithm level.
2. Why distinct ratio bands for 002 + 003? — 003 (gate matmul)
   is the pinned root cause; tighter band means more sensitive
   regression detection at the bisected location.
3. Why share verdict for 004+005? — Identical exit-code-zero
   reduction.
4. Why pin 28-line min for 006? — Canonical 28-layer
   Qwen2.5-Coder-7B teacher; PR cascade regression guard.
5. Why 24 tests across 4 verdict sections? — Pass band +
   boundary + below/above + NaN/Inf + provenance per ratio
   verdict; minimal exit-code coverage; min-line boundary.

## Cross-reference

PARITY-002's `p002_fail_18_23x_ship_007_baseline` test
explicitly captures the observed regression value from
`2026-04-26 session SHIP-007 narrowing` memory — provides a
named regression-class sentinel for any future SHIP-007 work.

## Tests

24 unit tests, all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant