docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76#1090
Merged
Merged
Conversation
1bd5dd3 to
72ea3b4
Compare
…=10.24) — spec v2.75.0 → v2.76.0
Live three-stage bisection on canonical 7B teacher pinpoints the divergence
point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0
qkv at four stages with prompt "What is 2+2?":
| Stage | mean | std | Match GGUF (1.14)? |
|-------|------|-----|---------------------|
| Embedding | 1e-5 | 0.0174 | OK (input) |
| Post-RMSNorm | -8e-5 | 0.221 | OK (input) |
| Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance |
| qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large |
| Post-bias | +0.256 | 10.329 | matches APR trace blowup |
The 9× std blowup happens ENTIRELY at the qkv_bias addition step
(pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias
matches APR's existing trace. K-part bias is most extreme (post-bias
std=29.49).
PR E v2 is now scoped to ONE specific investigation per §31.4:
- dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes
- dump GGUF's same 3 tensors
- byte-compare:
- if APR != GGUF, the GGUF→APR converter is broken
- if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting
§31 falsification chain (now closed at the root):
§15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
→ §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
→ §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
comparison)
→ §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)
The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages
found it in one pass.
Drift-prevention test for next session (per §31.5): assert per-layer
|APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10.
Files:
- crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
- evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis)
- §31 spec section (8 subsections)
- Header: v2.75.0 → v2.76.0
Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s ARE byte-identical APR=GGUF
Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:
ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte
layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte
So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.
This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:
- QKV matmul kernel: ✓ correct (§30)
- QKV bias bytes: ✓ correct (§32)
- Layer-3 FFN weight bytes: ✓ correct (this commit)
The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.
Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.
Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt
Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
72ea3b4 to
c1967e7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Live three-stage bisection on canonical 7B teacher pinpoints SHIP-007's root cause exactly. The 9× std blowup that propagates to layer-3's 18× ffn_swigl ratio happens entirely at the qkv_bias addition step.
qkv_biasitselfPre-bias APR matmul output (std=0.92) agrees with GGUF (std=1.14) within Q4K tolerance — the matmul is correct. The bug is in how
qkv_biasis loaded or stored in the .apr file.Falsification chain (now closed at the root)
The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass.
PR E v2 scope (next session)
Per §31.4, ONE specific investigation:
blk.0.attn_q.bias/attn_k.bias/attn_v.biasbytescrates/aprender-core/src/format/converter/)load_qkv_biasinmod_dequant_q4k_apr.rs:210-236) is misinterpretingCoverage impact
Files
crates/aprender-serve/examples/diag_qkv_bisection_layer0.rsevidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txtevidence/ship-007-qkv-bisection-2026-04-27/findings.mdTest plan
cargo build --release -p aprender-serve --example diag_qkv_bisection_layer0Methodology
§30 falsified §28's hypothesis. §31's bisection localized the bug ONE STAGE PER ITERATION (4 stages tested in one pass). Toyota Way "five whys" framework: §28's "ffn_swigl matmul precision" was a SYMPTOMATIC site; §31's "qkv_bias upstream of all this" is the CAUSAL site. The cascade through silu saturation amplifies a 9× layer-0 gap into 18× layer-3 ffn_swigl.
🤖 Generated with Claude Code