Skip to content

docs(ship-two-001): §28 — SHIP-007 root cause REFINED to F32 vs Q4K matmul precision mismatch — spec v2.72.0 → v2.73.0#1085

Closed
noahgift wants to merge 3 commits into
mainfrom
feat/spec-28-ship-007-root-cause-refined
Closed

docs(ship-two-001): §28 — SHIP-007 root cause REFINED to F32 vs Q4K matmul precision mismatch — spec v2.72.0 → v2.73.0#1085
noahgift wants to merge 3 commits into
mainfrom
feat/spec-28-ship-007-root-cause-refined

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

§27 located the bug at "(layer=3, silu_g * u)" with 18.23× ratio. §28 refines: the multiply is not the bug. The divergence STARTS upstream at the gate-projection matmul.

Stacked on PR #1084 (§27).

Re-reading evidence at layer 3

Stat APR GGUF Ratio
ffn_gate std 1.92 1.41 1.36× ← divergence starts
ffn_silu std 0.168 0.037 4.59× ← amplified by silu
ffn_swigl std 1.22 0.067 18.23× ← amplified by multiply

The 1.36× gate-matmul precision difference gets non-linearly amplified by silu 4.59× because gate values are deep in saturated regime (mean=−6) where silu(x) is exquisitely sensitive.

Root cause

Path Implementation Precision
APR helpers::f32_matmul (mod_apr_transformer.rs:138-140) F32 after Q4K dequant
GGUF fused_q4k_q8k_parallel_matvec_into Q4K-aware fused (no dequant)

Q4K dequant is not a perfect inverse — APR and GGUF produce per-element-different outputs within Q4K tolerance, but enough for silu at layer 3 to amplify 4.59×.

Five-whys reaches root

1. Why wrong logits? → APR forward path (§16)
2. Why APR forward path? → layer 3 FFN (§17)
3. Why layer 3 FFN? → ffn_swigl multiply (§23, symptomatic)
4. Why ffn_swigl multiply? → silu_g * u (§27, still symptomatic)
5. Why silu_g * u? → gate-matmul precision mismatch (§28, ROOT)

Fix surface refined

§-ref Surface Status
§27 inference.rs:160-164 silu_g*u symptomatic
§28 mod_apr_transformer.rs:138-140 helpers::f32_matmul causal

The §27.8 hypothesis #3 ("F32-vs-Q4K dequant defect at layer-3 input range") was correct.

Cascade-damping now fully explained

Layer 6+ gate values are NOT in the saturated silu regime — same matmul precision difference does NOT amplify dramatically. Hence ratios converge to ~1× past layer 5.

Falsifiable next PR sequence

  • PR D (drift-prevention test, ~50 LOC): assert APR-vs-GGUF per-layer ffn_swigl std ratio ∈ [0.5, 2.0] for all 28 layers. FAILS today.
  • PR E (the fix, ~150-300 LOC): replace helpers::f32_matmul with Q4K-aware fused kernel dispatch.

PR D + PR E together discharge 5 MODEL-1 PARTIALs (SHIP-002/005/006/007/008 per §17.5).

Coverage flip projection

State PARTIAL DISCHARGED
Now (§28) 33 12
PR D merged (test added, fails) 33 12
PR E merged (test passes, fix lands) 28 17

Test plan

  • CI workspace-test passes
  • CI gate passes
  • Spec banner v2.73.0 reflects §28
  • §28 references match §27 evidence files

🤖 Generated with Claude Code

noahgift and others added 2 commits April 27, 2026 10:10
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0

Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27.
Built apr from PR #1083 branch (commits 77c016b + c657968
+ f249464 from PR A+B+C cascade). Ran `apr trace --payload`
on canonical 7B teacher in BOTH formats with identical prompt
+ tokenizer.

Result:
| Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio |
|------:|------------------:|-------------------:|------:|
| 3     | 1.2216            | 0.0670             | 18.23x |

§26.4 binding criterion threshold: ≥10x → APR-side bug.
**Observed 18.23x — 8x past the threshold, decisive verdict.**

The investigation chain that started in §15.4 (GPU GQA
elimination) has reached its conclusion at §27:

§15.4 → §16 → §17 → §23 → §27 (this)
"Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)"
→ "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**"

Cascade-damping signature confirmed:
- Layers 0-2: ratio ~1.1x (normal)
- Layer 3: 18.23x (anomaly)
- Layers 4-5: 3.3-4.5x (cascade)
- Layer 6+: ~1x (recovered)

This is consistent with a localized perturbation (off-by-one,
buffer aliasing, or F32-vs-Q4K dequant defect at layer-3-
specifically) rather than persistent residual-stream corruption.

Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once
(SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12
→ 28+17 when fix lands.

§27 does NOT discharge by itself — it locates the bug for fixing.
Next investigation reads `inference.rs:160-164` and tests 4 hypotheses:
1. Off-by-one slice indexing
2. Buffer aliasing (scratch reuse pattern)
3. F32-vs-Q4K dequant defect at layer-3 input range
4. Activation overflow (SiLU saturation amplifies multiply)

Methodology held throughout: zero eprintln!, zero route-arounds,
apr is canonical (§26.8), all instrumentation via `apr trace
--payload`. Lambda-labs lane pre-authorized.

Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/:
- apr-trace.txt (13.5 KB)
- gguf-trace.txt (13.7 KB)
- binding-criterion-summary.json

Note: §27 reproduction requires PR #1081 + #1082 + #1083
cascade to merge first (the apr trace --payload <gguf> wiring
is in PR C). Evidence was generated with a local build of PR
#1083 branch.

Spec v2.71.0 → v2.72.0. Coverage flip pending fix.

Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict
References:
- §15.4 (PR #1062) — GPU GQA eliminated
- §16 (PR #1063) — APR CPU isolated
- §17 (PR #1064) — layer-3 FFN sub-block
- §23 (PR #1075) — layer-3 ffn_swigl named
- §26.8 (PR #1079) — apr-is-canonical methodology rule
- PR #1081 (P3 PR A scaffold)
- PR #1082 (P3 PR B sub-FFN populate)
- PR #1083 (P3 PR C CLI wiring)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ul vs GGUF Q4K-fused matmul precision mismatch — spec v2.72.0 → v2.73.0

§27 located the bug at "(layer=3, silu_g * u multiply)" with
18.23x ratio. §28 refines: the multiply is not the bug. The
divergence STARTS upstream at the gate-projection matmul.

Re-reading evidence/ship-007-apr-vs-gguf-2026-04-27/ at layer 3:
| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_gate std | 1.92 | 1.41 | **1.36x ← starts here** |
| ffn_silu std | 0.168 | 0.037 | 4.59x ← amplified by silu |
| ffn_swigl std | 1.22 | 0.067 | 18.23x ← amplified by multiply |

The 1.36x gate-matmul std difference gets non-linearly amplified
by silu (4.59x) at layer 3 because gate values are deep in the
saturated regime (mean=-6). Silu(-x) for x large is exquisitely
sensitive to small gate perturbations near the saturation
boundary.

Root cause:
- APR uses helpers::f32_matmul (mod_apr_transformer.rs:138-140) —
  pure F32 arithmetic on dequantized weights
- GGUF uses fused_q4k_q8k_parallel_matvec_into — Q4K-aware fused
  matmul directly on Q4K weights
- Q4K dequantization is not a perfect inverse: APR and GGUF
  produce per-element-different outputs within Q4K tolerance
- Silu at layer 3 amplifies the difference 4.59x

Five-whys chain reaches root:
1. Why wrong logits? - APR forward path
2. Why APR forward path? - layer 3 FFN
3. Why layer 3 FFN? - ffn_swigl multiply (symptom)
4. Why ffn_swigl multiply? - silu_g * u (still symptom)
5. Why silu_g * u? - **gate-matmul precision mismatch (root)**

Fix surface refined:
| §-ref | Surface | Status |
|-------|---------|--------|
| §27 | inference.rs:160-164 silu_g*u | symptomatic |
| §28 | mod_apr_transformer.rs:138-140 helpers::f32_matmul | causal |

Why GGUF doesn't have this issue: GGUF NEVER dequantizes Q4K to
F32 for matmul. It uses fused kernels throughout. APR's F32 path
is what introduces the precision gap.

The §27.8 hypothesis #3 ("F32-vs-Q4K dequant defect at layer-3
input range") was the correct one. §28 confirms it.

Cascade-damping signature now explained: layer 6+ gate values
are NOT in the saturated silu regime, so the same matmul
precision difference does NOT amplify dramatically. Layers 0-2
recovered after upstream attn anomalies. Only layers 3-5 cascade.

Falsifiable next PR sequence:
- PR D (drift-prevention test, ~50 LOC): assert APR vs GGUF
  per-layer ffn_swigl std ratio in [0.5, 2.0] for all 28 layers.
  FAILS today; PASSES once PR E lands.
- PR E (the fix, ~150-300 LOC): replace helpers::f32_matmul with
  Q4K-aware fused kernel dispatch in AprTransformer's matmul.

PR D + PR E together discharge 5 MODEL-1 PARTIALs
(SHIP-002/005/006/007/008 per §17.5).

Spec v2.72.0 → v2.73.0. Coverage flip pending PR E.

Spec: SPEC-SHIP-TWO-001 §28 root cause refinement of §27
References:
- §27 (PR #1084) — P3 binding criterion verdict
- §27.8 hypothesis #3 — F32-vs-Q4K dequant (now confirmed)
- evidence/ship-007-apr-vs-gguf-2026-04-27/ — full sub-FFN bisection
- crates/aprender-serve/src/apr_transformer/mod_apr_transformer.rs:138-140

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087)

Session-end snapshot consolidating today's 10-PR cascade into a
single source-of-truth for next session.

The goal: ship two models to HF, both built end-to-end on the
in-tree Sovereign AI Stack.

Coverage scoreboard EOD 2026-04-27:
| Category    | DISCHARGED | PARTIAL | Total | %D  |
|-------------|-----------:|--------:|------:|----:|
| MODEL-1     |          5 |       5 |    10 | 50% |
| MODEL-2     |          3 |       9 |    12 | 25% |
| GPUTRAIN    |          7 |       0 |     7 |100% |
| Ship Gates  |          - |      12 |    12 |  0% |
| Falsifiers  |          - |       7 |     7 |  0% |
| Sum         |         15 |      33 |    48 | 31% |

Critical path — MODEL-1: PR E (replace helpers::f32_matmul with
Q4K-fused dispatch) discharges 5 PARTIALs at one fix site.
~150-300 LOC.

Critical path — MODEL-2: P1.1 (apr pull dataset extension) →
P1.4 (corpus pull) → P2 (100K-step training) discharges 9
PARTIALs.

10-PR session cascade (6 merged, 4 open + this):
- #1076-#1080: spec + contract foundation (MERGED)
- #1081: P3 PR A scaffold (MERGED)
- #1082-#1083: P3 PR B+C wiring (OPEN, stacked)
- #1084-#1085: §27/§28 binding criterion + root cause (OPEN)
- #1086: PR D forward-parity contract (OPEN)

Falsification chain (complete, root-reached):
§15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next)
"forward path" → ... → "APR F32 vs GGUF Q4K matmul precision"
                            → "binding criterion as durable spec"
                            → "fix at mod_apr_transformer.rs:138-140"

Methodology preserved: zero eprintln!, zero route-arounds, apr
canonical, contract-first, lambda-labs pre-authorized, 5-whys
reaches root.

Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2
(9 ACs).

Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is
a scoreboard, not a discharge.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Base automatically changed from feat/spec-27-ship-007-binding-criterion-decided to main April 27, 2026 10:45
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by PR #1088 §30 (merged 2026-04-27). The §28 hypothesis (F32-vs-Q4K matmul precision) was empirically refuted by live diagnostics — see §30.1/§30.2 in the spec. The §28 mechanical fix would have changed <0.5% of std, not closed the 9× layer-0 qkv gap. Next-step bisection plan in §30.4.

@noahgift noahgift closed this Apr 27, 2026
noahgift added a commit that referenced this pull request Apr 27, 2026
…ct codifying the §28 binding criterion

PR D of the SHIP-TWO-001 §28.8 falsifiable PR sequence. Authors a
provable contract that defines the per-layer ffn_swigl parity
binding criterion as durable spec. Status PROPOSED until PR E
(the actual fix replacing helpers::f32_matmul with Q4K-fused
matmul dispatch) lands.

3 equations:
- per_layer_ffn_swigl_parity: r_i = APR.std / GGUF.std ∈ [0.5, 2.0]
  for all i ∈ [0, 28). Currently FAILS at layer 3 (r_3 = 18.23×).
- divergence_starts_at_gate_matmul: §28 evidence — divergence
  originates at gate-projection matmul (1.36×), amplified by
  silu (4.59×) into the 18.23× ffn_swigl ratio.
- fix_must_match_gguf_kernel_path: §28.4 — fix replaces
  f32_matmul with fused_q4k_q8k_parallel_matvec_into when
  weight.qtype == GGUF_TYPE_Q4_K.

6 falsification tests:
- FALSIFY-APR-GGUF-PARITY-001: per-layer ffn_swigl ratio bounds
- -002: layer 3 specifically
- -003: gate matmul precision is the root cause (Toyota Way
  enforcement — prevents route-around fix at silu_g*u)
- -004: pv validate
- -005: F32-native paths unchanged
- -006: apr trace --payload still emits ffn_swigl on GGUF

4 proof obligations + 2 Kani harnesses with bounds.

Validation:
  $ pv validate contracts/apr-vs-gguf-forward-parity-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

  $ pv score contracts/apr-vs-gguf-forward-parity-v1.yaml
  apr-vs-gguf-forward-parity-v1 — 0.71 (Grade C)
  Spec: 0.70 | Falsify: 1.00 | Kani: 0.25 | Lean: 0.50 | Bind: 1.00

Status: PROPOSED. Promotion to ACTIVE requires:
- PR E lands (replaces f32_matmul with Q4K-fused dispatch)
- Live drift-prevention test PASSES on canonical 7B teacher
- All 6 FALSIFY-APR-GGUF-PARITY-* gates pass

On PR E success:
- Coverage flip 33+12 → 28+17 (§26.5 / §28.9)
- Discharges SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008
  (5 MODEL-1 PARTIALs transitively gated on §17.5)

This PR (D) ships the binding criterion as durable spec. PR E
ships the fix. §29 records the discharge.

Spec: SPEC-SHIP-TWO-001 §28.8
References:
- §27 (PR #1084) — P3 binding criterion verdict (18.23× ratio)
- §28 (PR #1085) — root cause refined to F32 vs Q4K matmul
- evidence/ship-007-apr-vs-gguf-2026-04-27/ — full sub-FFN bisection
- feedback_fix_root_cause_never_route_around.md
- contracts/qwen2-e2e-verification-v1.yaml (sibling MODEL-1 contract)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…ct codifying the §28 binding criterion

PR D of the SHIP-TWO-001 §28.8 falsifiable PR sequence. Authors a
provable contract that defines the per-layer ffn_swigl parity
binding criterion as durable spec. Status PROPOSED until PR E
(the actual fix replacing helpers::f32_matmul with Q4K-fused
matmul dispatch) lands.

3 equations:
- per_layer_ffn_swigl_parity: r_i = APR.std / GGUF.std ∈ [0.5, 2.0]
  for all i ∈ [0, 28). Currently FAILS at layer 3 (r_3 = 18.23×).
- divergence_starts_at_gate_matmul: §28 evidence — divergence
  originates at gate-projection matmul (1.36×), amplified by
  silu (4.59×) into the 18.23× ffn_swigl ratio.
- fix_must_match_gguf_kernel_path: §28.4 — fix replaces
  f32_matmul with fused_q4k_q8k_parallel_matvec_into when
  weight.qtype == GGUF_TYPE_Q4_K.

6 falsification tests:
- FALSIFY-APR-GGUF-PARITY-001: per-layer ffn_swigl ratio bounds
- -002: layer 3 specifically
- -003: gate matmul precision is the root cause (Toyota Way
  enforcement — prevents route-around fix at silu_g*u)
- -004: pv validate
- -005: F32-native paths unchanged
- -006: apr trace --payload still emits ffn_swigl on GGUF

4 proof obligations + 2 Kani harnesses with bounds.

Validation:
  $ pv validate contracts/apr-vs-gguf-forward-parity-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

  $ pv score contracts/apr-vs-gguf-forward-parity-v1.yaml
  apr-vs-gguf-forward-parity-v1 — 0.71 (Grade C)
  Spec: 0.70 | Falsify: 1.00 | Kani: 0.25 | Lean: 0.50 | Bind: 1.00

Status: PROPOSED. Promotion to ACTIVE requires:
- PR E lands (replaces f32_matmul with Q4K-fused dispatch)
- Live drift-prevention test PASSES on canonical 7B teacher
- All 6 FALSIFY-APR-GGUF-PARITY-* gates pass

On PR E success:
- Coverage flip 33+12 → 28+17 (§26.5 / §28.9)
- Discharges SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008
  (5 MODEL-1 PARTIALs transitively gated on §17.5)

This PR (D) ships the binding criterion as durable spec. PR E
ships the fix. §29 records the discharge.

Spec: SPEC-SHIP-TWO-001 §28.8
References:
- §27 (PR #1084) — P3 binding criterion verdict (18.23× ratio)
- §28 (PR #1085) — root cause refined to F32 vs Q4K matmul
- evidence/ship-007-apr-vs-gguf-2026-04-27/ — full sub-FFN bisection
- feedback_fix_root_cause_never_route_around.md
- contracts/qwen2-e2e-verification-v1.yaml (sibling MODEL-1 contract)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…ct codifying the §28 binding criterion (#1086)

PR D of the SHIP-TWO-001 §28.8 falsifiable PR sequence. Authors a
provable contract that defines the per-layer ffn_swigl parity
binding criterion as durable spec. Status PROPOSED until PR E
(the actual fix replacing helpers::f32_matmul with Q4K-fused
matmul dispatch) lands.

3 equations:
- per_layer_ffn_swigl_parity: r_i = APR.std / GGUF.std ∈ [0.5, 2.0]
  for all i ∈ [0, 28). Currently FAILS at layer 3 (r_3 = 18.23×).
- divergence_starts_at_gate_matmul: §28 evidence — divergence
  originates at gate-projection matmul (1.36×), amplified by
  silu (4.59×) into the 18.23× ffn_swigl ratio.
- fix_must_match_gguf_kernel_path: §28.4 — fix replaces
  f32_matmul with fused_q4k_q8k_parallel_matvec_into when
  weight.qtype == GGUF_TYPE_Q4_K.

6 falsification tests:
- FALSIFY-APR-GGUF-PARITY-001: per-layer ffn_swigl ratio bounds
- -002: layer 3 specifically
- -003: gate matmul precision is the root cause (Toyota Way
  enforcement — prevents route-around fix at silu_g*u)
- -004: pv validate
- -005: F32-native paths unchanged
- -006: apr trace --payload still emits ffn_swigl on GGUF

4 proof obligations + 2 Kani harnesses with bounds.

Validation:
  $ pv validate contracts/apr-vs-gguf-forward-parity-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

  $ pv score contracts/apr-vs-gguf-forward-parity-v1.yaml
  apr-vs-gguf-forward-parity-v1 — 0.71 (Grade C)
  Spec: 0.70 | Falsify: 1.00 | Kani: 0.25 | Lean: 0.50 | Bind: 1.00

Status: PROPOSED. Promotion to ACTIVE requires:
- PR E lands (replaces f32_matmul with Q4K-fused dispatch)
- Live drift-prevention test PASSES on canonical 7B teacher
- All 6 FALSIFY-APR-GGUF-PARITY-* gates pass

On PR E success:
- Coverage flip 33+12 → 28+17 (§26.5 / §28.9)
- Discharges SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008
  (5 MODEL-1 PARTIALs transitively gated on §17.5)

This PR (D) ships the binding criterion as durable spec. PR E
ships the fix. §29 records the discharge.

Spec: SPEC-SHIP-TWO-001 §28.8
References:
- §27 (PR #1084) — P3 binding criterion verdict (18.23× ratio)
- §28 (PR #1085) — root cause refined to F32 vs Q4K matmul
- evidence/ship-007-apr-vs-gguf-2026-04-27/ — full sub-FFN bisection
- feedback_fix_root_cause_never_route_around.md
- contracts/qwen2-e2e-verification-v1.yaml (sibling MODEL-1 contract)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant