docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76 by noahgift · Pull Request #1090 · paiml/aprender

noahgift · 2026-04-27T14:27:43Z

Summary

Live three-stage bisection on canonical 7B teacher pinpoints SHIP-007's root cause exactly. The 9× std blowup that propagates to layer-3's 18× ffn_swigl ratio happens entirely at the qkv_bias addition step.

Stage	mean	std	Match GGUF (1.14)?
Embedding	1e-5	0.0174	input
Post-RMSNorm	-8e-5	0.221	input
Post-matmul, pre-bias	-0.0159	0.925	YES — Q4K tolerance
`qkv_bias` itself	+0.272	10.243	⚠ ~10× too large
Post-bias	+0.256	10.329	matches existing APR trace

Pre-bias APR matmul output (std=0.92) agrees with GGUF (std=1.14) within Q4K tolerance — the matmul is correct. The bug is in how qkv_bias is loaded or stored in the .apr file.

Falsification chain (now closed at the root)

§15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
→ §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
→ §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
  comparison — q4k_layers populated, F32 fused-qkv ≡ Q4K dispatch)
→ §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)

The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass.

PR E v2 scope (next session)

Per §31.4, ONE specific investigation:

Dump APR's blk.0.attn_q.bias / attn_k.bias / attn_v.bias bytes
Dump GGUF's same 3 tensors
Byte-compare:
- If APR != GGUF, the GGUF→APR converter is broken (crates/aprender-core/src/format/converter/)
- If APR == GGUF, the loader (load_qkv_bias in mod_dequant_q4k_apr.rs:210-236) is misinterpreting

Coverage impact

Scoreboard unchanged this PR: 15+33 (still pre-DISCHARGE)
Will flip to 20+28 when PR E v2 lands (5 MODEL-1 PARTIALs discharge)

Files

Spec: §31 added (~80 lines, 8 subsections)
- 31.1 The decisive empirical bisection
- 31.2 The verdict
- 31.3 Falsification chain (now closed at the root)
- 31.4 PR E v2 scope (one named site to investigate)
- 31.5 Drift-prevention test (immediate)
- 31.6 Coverage scoreboard impact
- 31.7 Methodology note — why this iteration succeeded
- 31.8 Files
Diagnostic (re-runnable on noah-Lambda-Vector):
- crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs
Evidence (live RTX 4090 output):
- evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
- evidence/ship-007-qkv-bisection-2026-04-27/findings.md

Test plan

Diagnostic builds with cargo build --release -p aprender-serve --example diag_qkv_bisection_layer0
Diagnostic ran live on noah-Lambda-Vector RTX 4090; results match expected APR trace numbers (post-bias std=10.33)
Spec v2.76.0 self-consistent with §29 scoreboard

Methodology

§30 falsified §28's hypothesis. §31's bisection localized the bug ONE STAGE PER ITERATION (4 stages tested in one pass). Toyota Way "five whys" framework: §28's "ffn_swigl matmul precision" was a SYMPTOMATIC site; §31's "qkv_bias upstream of all this" is the CAUSAL site. The cascade through silu saturation amplifies a 9× layer-0 gap into 18× layer-3 ffn_swigl.

🤖 Generated with Claude Code

…=10.24) — spec v2.75.0 → v2.76.0 Live three-stage bisection on canonical 7B teacher pinpoints the divergence point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0 qkv at four stages with prompt "What is 2+2?": | Stage | mean | std | Match GGUF (1.14)? | |-------|------|-----|---------------------| | Embedding | 1e-5 | 0.0174 | OK (input) | | Post-RMSNorm | -8e-5 | 0.221 | OK (input) | | Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance | | qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large | | Post-bias | +0.256 | 10.329 | matches APR trace blowup | The 9× std blowup happens ENTIRELY at the qkv_bias addition step (pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias matches APR's existing trace. K-part bias is most extreme (post-bias std=29.49). PR E v2 is now scoped to ONE specific investigation per §31.4: - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes - dump GGUF's same 3 tensors - byte-compare: - if APR != GGUF, the GGUF→APR converter is broken - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting §31 falsification chain (now closed at the root): §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN) → §23 (layer 3, ffn_swigl) → §27 ratio 18.23× → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel comparison) → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED) The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass. Drift-prevention test for next session (per §31.5): assert per-layer |APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10. Files: - crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt - evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis) - §31 spec section (8 subsections) - Header: v2.75.0 → v2.76.0 Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s ARE byte-identical APR=GGUF Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's .apr and .gguf files. Result: ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36× ratio per existing trace) does NOT come from differing weight bytes. This eliminates the GGUF→APR converter as the bug surface for layer 3. Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias byte-identical), the elimination chain is now: - QKV matmul kernel: ✓ correct (§30) - QKV bias bytes: ✓ correct (§32) - Layer-3 FFN weight bytes: ✓ correct (this commit) The remaining hypothesis: cumulative layer-by-layer F32 precision drift through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36 (10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating ~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different between the two formats to push silu into different saturation regions, producing the 18× ffn_swigl ratio. Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default (zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to populate them. The existing apr-trace.txt and gguf-trace.txt evidence files (2026-04-27) were generated when PR B was applied locally to the binary — those numbers are valid but require PR B to land on main for reproducibility. Files added: - crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt Coverage scoreboard unchanged. Investigation continues; PR E v3 scope narrows further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 27, 2026 14:27

noahgift force-pushed the feat/spec-31-ship-007-qkv-bisection branch 2 times, most recently from 1bd5dd3 to 72ea3b4 Compare April 27, 2026 15:54

noahgift and others added 3 commits April 27, 2026 18:56

noahgift force-pushed the feat/spec-31-ship-007-qkv-bisection branch from 72ea3b4 to c1967e7 Compare April 27, 2026 16:56

noahgift merged commit 0f37400 into main Apr 27, 2026
10 checks passed

noahgift deleted the feat/spec-31-ship-007-qkv-bisection branch April 27, 2026 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76#1090

docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76#1090
noahgift merged 3 commits into
mainfrom
feat/spec-31-ship-007-qkv-bisection

noahgift commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 27, 2026

Summary

Falsification chain (now closed at the root)

PR E v2 scope (next session)

Coverage impact

Files

Test plan

Methodology

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant