feat(p3-pra): SHIP-007 GGUF forward_traced scaffold — 6 non-FFN LayerActivation fields per layer#1081
Merged
Merged
Conversation
…N LayerActivation fields populated, 4 sub-FFN default-zero P3 PR A of the SHIP-TWO-001 §26.4 root-cause-pin chain. Implements `OwnedQuantizedModel::forward_traced(&[u32]) -> Result<ForwardTrace>` mirroring `AprTransformer::forward_traced`. Per project_ship_007_gguf_forward_traced_plan.md Option A: full clone of the orchestrator (forward_single_with_scratch at results.rs:447-587) with inline ActivationStats::from_slice() at each layer boundary. Hot-path safety preserved — production forward_single_with_scratch unchanged. PR A populates the 6 non-FFN LayerActivation fields per layer: - attn_norm_stats (post attn_norm) - qkv_stats (post QKV projection, qkv_dim slice) - attn_out_stats (post attn_proj) - ffn_norm_stats (post FFN norm) - ffn_out_stats (post ffn_down) - output_stats (post FFN residual) The 4 sub-FFN fields default-zero (PR B will fill via cloned scratch_swiglu_ffn_traced): - ffn_gate_stats - ffn_up_stats - ffn_silu_gate_stats - ffn_swiglu_inner_stats Phase 1: prefill all tokens except last via existing forward_single_with_scratch (matches production semantics, fills KV cache). Phase 2: process LAST token through inlined orchestrator that captures stats. APR-side semantics: one LayerActivation per layer from the last token's layer states. Encoder-decoder paths (T5/Whisper) explicitly Err-out — out of scope for PR A per plan memory. Visibility changes: 3 helpers in results.rs flipped from private to pub(crate) so the new sibling module can call them: - scratch_attention_block - scratch_swiglu_ffn - scratch_gelu_ffn Validated: - `cargo check -p aprender-serve --lib` exits 0 - `cargo clippy -p aprender-serve --lib -- -D warnings` exits 0 - `TracedForward` trait impl follows AprTransformer pattern (delegate &mut self → immutable inherent method) Per spec §26.4 binding criterion: this PR alone does NOT discharge P3 — that requires PR B (sub-FFN populate) + comparison run yielding APR vs GGUF layer-3 ffn_swigl ratio. PR A just adds the scaffold so PR B has somewhere to put the sub-FFN stats. Spec: SPEC-SHIP-TWO-001 §26.4 P3 References: - project_ship_007_gguf_forward_traced_plan.md (Plan agent design 2026-04-26) - §17 (layer-3 ffn_out anomaly named) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
4 tasks
5 tasks
4 tasks
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087) Session-end snapshot consolidating today's 10-PR cascade into a single source-of-truth for next session. The goal: ship two models to HF, both built end-to-end on the in-tree Sovereign AI Stack. Coverage scoreboard EOD 2026-04-27: | Category | DISCHARGED | PARTIAL | Total | %D | |-------------|-----------:|--------:|------:|----:| | MODEL-1 | 5 | 5 | 10 | 50% | | MODEL-2 | 3 | 9 | 12 | 25% | | GPUTRAIN | 7 | 0 | 7 |100% | | Ship Gates | - | 12 | 12 | 0% | | Falsifiers | - | 7 | 7 | 0% | | Sum | 15 | 33 | 48 | 31% | Critical path — MODEL-1: PR E (replace helpers::f32_matmul with Q4K-fused dispatch) discharges 5 PARTIALs at one fix site. ~150-300 LOC. Critical path — MODEL-2: P1.1 (apr pull dataset extension) → P1.4 (corpus pull) → P2 (100K-step training) discharges 9 PARTIALs. 10-PR session cascade (6 merged, 4 open + this): - #1076-#1080: spec + contract foundation (MERGED) - #1081: P3 PR A scaffold (MERGED) - #1082-#1083: P3 PR B+C wiring (OPEN, stacked) - #1084-#1085: §27/§28 binding criterion + root cause (OPEN) - #1086: PR D forward-parity contract (OPEN) Falsification chain (complete, root-reached): §15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next) "forward path" → ... → "APR F32 vs GGUF Q4K matmul precision" → "binding criterion as durable spec" → "fix at mod_apr_transformer.rs:138-140" Methodology preserved: zero eprintln!, zero route-arounds, apr canonical, contract-first, lambda-labs pre-authorized, 5-whys reaches root. Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2 (9 ACs). Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is a scoreboard, not a discharge. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084) Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27. Built apr from PR #1083 branch (commits 77c016b + c657968 + f249464 from PR A+B+C cascade). Ran `apr trace --payload` on canonical 7B teacher in BOTH formats with identical prompt + tokenizer. Result: | Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio | |------:|------------------:|-------------------:|------:| | 3 | 1.2216 | 0.0670 | 18.23x | §26.4 binding criterion threshold: ≥10x → APR-side bug. **Observed 18.23x — 8x past the threshold, decisive verdict.** The investigation chain that started in §15.4 (GPU GQA elimination) has reached its conclusion at §27: §15.4 → §16 → §17 → §23 → §27 (this) "Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)" → "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**" Cascade-damping signature confirmed: - Layers 0-2: ratio ~1.1x (normal) - Layer 3: 18.23x (anomaly) - Layers 4-5: 3.3-4.5x (cascade) - Layer 6+: ~1x (recovered) This is consistent with a localized perturbation (off-by-one, buffer aliasing, or F32-vs-Q4K dequant defect at layer-3- specifically) rather than persistent residual-stream corruption. Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once (SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12 → 28+17 when fix lands. §27 does NOT discharge by itself — it locates the bug for fixing. Next investigation reads `inference.rs:160-164` and tests 4 hypotheses: 1. Off-by-one slice indexing 2. Buffer aliasing (scratch reuse pattern) 3. F32-vs-Q4K dequant defect at layer-3 input range 4. Activation overflow (SiLU saturation amplifies multiply) Methodology held throughout: zero eprintln!, zero route-arounds, apr is canonical (§26.8), all instrumentation via `apr trace --payload`. Lambda-labs lane pre-authorized. Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/: - apr-trace.txt (13.5 KB) - gguf-trace.txt (13.7 KB) - binding-criterion-summary.json Note: §27 reproduction requires PR #1081 + #1082 + #1083 cascade to merge first (the apr trace --payload <gguf> wiring is in PR C). Evidence was generated with a local build of PR #1083 branch. Spec v2.71.0 → v2.72.0. Coverage flip pending fix. Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict References: - §15.4 (PR #1062) — GPU GQA eliminated - §16 (PR #1063) — APR CPU isolated - §17 (PR #1064) — layer-3 FFN sub-block - §23 (PR #1075) — layer-3 ffn_swigl named - §26.8 (PR #1079) — apr-is-canonical methodology rule - PR #1081 (P3 PR A scaffold) - PR #1082 (P3 PR B sub-FFN populate) - PR #1083 (P3 PR C CLI wiring) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…s ARE byte-identical APR=GGUF
Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:
ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte
So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.
This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:
- QKV matmul kernel: ✓ correct (§30)
- QKV bias bytes: ✓ correct (§32)
- Layer-3 FFN weight bytes: ✓ correct (this commit)
The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.
Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.
Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt
Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…s ARE byte-identical APR=GGUF
Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:
ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte
So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.
This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:
- QKV matmul kernel: ✓ correct (§30)
- QKV bias bytes: ✓ correct (§32)
- Layer-3 FFN weight bytes: ✓ correct (this commit)
The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.
Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.
Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt
Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…s ARE byte-identical APR=GGUF
Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:
ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte
So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.
This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:
- QKV matmul kernel: ✓ correct (§30)
- QKV bias bytes: ✓ correct (§32)
- Layer-3 FFN weight bytes: ✓ correct (this commit)
The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.
Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.
Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt
Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=10.24) — v2.75 → v2.76 (#1090) * docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0 Live three-stage bisection on canonical 7B teacher pinpoints the divergence point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0 qkv at four stages with prompt "What is 2+2?": | Stage | mean | std | Match GGUF (1.14)? | |-------|------|-----|---------------------| | Embedding | 1e-5 | 0.0174 | OK (input) | | Post-RMSNorm | -8e-5 | 0.221 | OK (input) | | Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance | | qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large | | Post-bias | +0.256 | 10.329 | matches APR trace blowup | The 9× std blowup happens ENTIRELY at the qkv_bias addition step (pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias matches APR's existing trace. K-part bias is most extreme (post-bias std=29.49). PR E v2 is now scoped to ONE specific investigation per §31.4: - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes - dump GGUF's same 3 tensors - byte-compare: - if APR != GGUF, the GGUF→APR converter is broken - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting §31 falsification chain (now closed at the root): §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN) → §23 (layer 3, ffn_swigl) → §27 ratio 18.23× → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel comparison) → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED) The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass. Drift-prevention test for next session (per §31.5): assert per-layer |APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10. Files: - crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt - evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis) - §31 spec section (8 subsections) - Header: v2.75.0 → v2.76.0 Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's .apr and .gguf files. Result: ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36× ratio per existing trace) does NOT come from differing weight bytes. This eliminates the GGUF→APR converter as the bug surface for layer 3. Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias byte-identical), the elimination chain is now: - QKV matmul kernel: ✓ correct (§30) - QKV bias bytes: ✓ correct (§32) - Layer-3 FFN weight bytes: ✓ correct (this commit) The remaining hypothesis: cumulative layer-by-layer F32 precision drift through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36 (10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating ~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different between the two formats to push silu into different saturation regions, producing the 18× ffn_swigl ratio. Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default (zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to populate them. The existing apr-trace.txt and gguf-trace.txt evidence files (2026-04-27) were generated when PR B was applied locally to the binary — those numbers are valid but require PR B to land on main for reproducibility. Files added: - crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt Coverage scoreboard unchanged. Investigation continues; PR E v3 scope narrows further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…fills 4 sub-FFN ActivationStats slots for layer-3 ffn_swigl bisection P3 PR B — completes the §26.4 SHIP-007 root-cause-pin chain by populating the 4 sub-FFN slots that PR A defaulted to zero. New helper `scratch_swiglu_ffn_traced` mirrors `scratch_swiglu_ffn` exactly (numerical path is byte-identical) but adds 4 capture points per `project_ship_007_gguf_forward_traced_plan.md`: | Stat | Capture point | |------|---------------| | ffn_gate_stats | scratch.ffn_gate AFTER bias, BEFORE silu | | ffn_up_stats | scratch.ffn_up AFTER bias | | ffn_silu_gate_stats | scratch.ffn_gate AFTER silu, BEFORE multiply | | ffn_swiglu_inner_stats | scratch.ffn_gate AFTER multiply (silu(g) * u) | The 4th capture point is the §23 17×-anomaly site that we need to compare APR vs GGUF on. APR side at layer 3 = 1.222 std (17.2× layer 2 baseline). GGUF side stat now observable via this PR. PR A's `forward_traced` updated to call the new helper for SwiGLU path and pass references to the 4 sub-FFN slots of the in-progress LayerActivation. GELU path unchanged — it has no SwiGLU components, sub-FFN slots stay at default-zero per APR semantics. Validated: - `cargo check -p aprender-serve --lib` exits 0 - `cargo clippy -p aprender-serve --lib -- -D warnings` exits 0 Stacked on PR #1081 (P3 PR A scaffold). Once PR A lands, this PR rebases cleanly onto main. After PR B merges, the §26.4 binding criterion can be falsified: run `apr trace --payload <gguf-teacher>` and `apr trace --payload <apr-teacher>`, compare layer-3 ffn_swigl std: - ratio ≥10× → SHIP-007 bug is APR-side in apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior; bug elsewhere Either outcome discharges all 5 transitively-blocked MODEL-1 PARTIALs at once per §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. Spec: SPEC-SHIP-TWO-001 §26.4 P3 References: - PR #1081 (P3 PR A scaffold) - §17 (layer-3 ffn_out 53× anomaly) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…fills 4 sub-FFN ActivationStats slots for layer-3 ffn_swigl bisection (#1082) P3 PR B — completes the §26.4 SHIP-007 root-cause-pin chain by populating the 4 sub-FFN slots that PR A defaulted to zero. New helper `scratch_swiglu_ffn_traced` mirrors `scratch_swiglu_ffn` exactly (numerical path is byte-identical) but adds 4 capture points per `project_ship_007_gguf_forward_traced_plan.md`: | Stat | Capture point | |------|---------------| | ffn_gate_stats | scratch.ffn_gate AFTER bias, BEFORE silu | | ffn_up_stats | scratch.ffn_up AFTER bias | | ffn_silu_gate_stats | scratch.ffn_gate AFTER silu, BEFORE multiply | | ffn_swiglu_inner_stats | scratch.ffn_gate AFTER multiply (silu(g) * u) | The 4th capture point is the §23 17×-anomaly site that we need to compare APR vs GGUF on. APR side at layer 3 = 1.222 std (17.2× layer 2 baseline). GGUF side stat now observable via this PR. PR A's `forward_traced` updated to call the new helper for SwiGLU path and pass references to the 4 sub-FFN slots of the in-progress LayerActivation. GELU path unchanged — it has no SwiGLU components, sub-FFN slots stay at default-zero per APR semantics. Validated: - `cargo check -p aprender-serve --lib` exits 0 - `cargo clippy -p aprender-serve --lib -- -D warnings` exits 0 Stacked on PR #1081 (P3 PR A scaffold). Once PR A lands, this PR rebases cleanly onto main. After PR B merges, the §26.4 binding criterion can be falsified: run `apr trace --payload <gguf-teacher>` and `apr trace --payload <apr-teacher>`, compare layer-3 ffn_swigl std: - ratio ≥10× → SHIP-007 bug is APR-side in apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior; bug elsewhere Either outcome discharges all 5 transitively-blocked MODEL-1 PARTIALs at once per §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. Spec: SPEC-SHIP-TWO-001 §26.4 P3 References: - PR #1081 (P3 PR A scaffold) - §17 (layer-3 ffn_out 53× anomaly) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…— emits per-layer LayerActivation telemetry P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…— emits per-layer LayerActivation telemetry (#1083) P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 11, 2026
…/6 sweep Algorithm-level PARTIAL discharge for FALSIFY-APR-GGUF-PARITY-002 through 006 per `contracts/apr-vs-gguf-forward-parity-v1.yaml`. Combined with PARITY-001 (already bound), this closes 6/6. ## ✅ Closes 6/6 apr-vs-gguf-forward-parity-v1 sweep **Twelve contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) - `apr-cli-operations-v1` (7/7) - `apr-cli-command-safety-v1` (4/4) - `apr-cli-publish-extra-v1` (10/10) - `apr-cli-dep-migration-v1` (2/2) - `apr-cli-distill-train-v1` (9/9) - `apr-cli-pull-dataset-v1` (8/8) - `apr-vs-gguf-forward-parity-v1` (6/6) ← this PR ## Why this matters for SHIP-007 / MODEL-1 ship The SHIP-007 dispatch-layer bug is the actual blocker for MODEL-1 GPU ship (per `feedback_model_1_ships_gpu_only`). This contract pins the parity gates the eventual fix must satisfy: - PARITY-002: layer-3 ffn_swigl ratio in `[0.5, 2.0]` (the 18.23× ratio observed in `2026-04-26 SHIP-007 narrowing` session would Fail). - PARITY-003: layer-3 ffn_gate ratio in `[0.7, 1.4]` (tighter band — gate matmul is the pinned root cause per §28). - PARITY-004 + 005: contract validity + non-Q4K regression. - PARITY-006: 28-layer ffn_swigl trace coverage (regression guard for PR cascade #1081/#1082/#1083). When the SHIP-007 fix lands, all 6 verdicts must Pass. This verdict pin gives the fix a concrete acceptance criterion at algorithm level. ## Verdict shapes - 002, 003: bounded-ratio with finite-check (catches NaN/±∞). - 004, 005: shared exit-code-zero verdict. - 006: count-threshold (≥ 28). ## Five-Whys 1. Why bind these now? — Closes 6/6 sweep; pins SHIP-007 acceptance criterion at algorithm level. 2. Why distinct ratio bands for 002 + 003? — 003 (gate matmul) is the pinned root cause; tighter band means more sensitive regression detection at the bisected location. 3. Why share verdict for 004+005? — Identical exit-code-zero reduction. 4. Why pin 28-line min for 006? — Canonical 28-layer Qwen2.5-Coder-7B teacher; PR cascade regression guard. 5. Why 24 tests across 4 verdict sections? — Pass band + boundary + below/above + NaN/Inf + provenance per ratio verdict; minimal exit-code coverage; min-line boundary. ## Cross-reference PARITY-002's `p002_fail_18_23x_ship_007_baseline` test explicitly captures the observed regression value from `2026-04-26 session SHIP-007 narrowing` memory — provides a named regression-class sentinel for any future SHIP-007 work. ## Tests 24 unit tests, all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
P3 PR A of the SHIP-TWO-001 §26.4 root-cause-pin chain. Implements
OwnedQuantizedModel::forward_traced(&[u32]) -> Result<ForwardTrace>mirroringAprTransformer::forward_traced.Per
project_ship_007_gguf_forward_traced_plan.mdOption A: full clone offorward_single_with_scratch(results.rs:447-587) with inlineActivationStats::from_slice()at each layer boundary. Hot-path safety preserved — productionforward_single_with_scratchunchanged.What this PR does
Phase 1: prefill tokens 0..N-1 via existing path (fills KV cache, matches production semantics).
Phase 2: process the LAST token through an inlined orchestrator that captures stats per layer.
6 non-FFN LayerActivation fields populated:
attn_norm_statsrms_norm_into/layer_norm_intoqkv_statsscratch_attention_block,scratch.qkv[..qkv_dim]attn_out_statsscratch_attention_block,scratch.attn_proj[..hidden_dim]ffn_norm_statsffn_out_statsscratch.ffn_down)output_statsscratch.hidden)4 sub-FFN fields default-zero (PR B will populate):
ffn_gate_stats,ffn_up_stats,ffn_silu_gate_stats,ffn_swiglu_inner_statsVisibility changes
3 helpers in
results.rsflipped from private topub(crate)so the new sibling module can call them:scratch_attention_blockscratch_swiglu_ffnscratch_gelu_ffnWhat this PR does NOT do (deferred to PR B)
PR B clones
scratch_swiglu_ffnintoscratch_swiglu_ffn_tracedwith stat capture at the 4 sub-FFN points (perproject_ship_007_gguf_forward_traced_plan.mdcapture map).Validated
TracedForwardtrait impl follows AprTransformer pattern (delegate&mut self→ immutable inherent method).Spec references
SPEC-SHIP-TWO-001 §26.4— P3 plan§17— layer-3 ffn_out 53× anomaly identified§23— layer-3 ffn_swigl is the first 17× anomaly site (APR side)project_ship_007_gguf_forward_traced_plan.md(Plan agent design 2026-04-26)Test plan
🤖 Generated with Claude Code