Skip to content

docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76#1090

Merged
noahgift merged 3 commits into
mainfrom
feat/spec-31-ship-007-qkv-bisection
Apr 27, 2026
Merged

docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — v2.75 → v2.76#1090
noahgift merged 3 commits into
mainfrom
feat/spec-31-ship-007-qkv-bisection

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Live three-stage bisection on canonical 7B teacher pinpoints SHIP-007's root cause exactly. The 9× std blowup that propagates to layer-3's 18× ffn_swigl ratio happens entirely at the qkv_bias addition step.

Stage mean std Match GGUF (1.14)?
Embedding 1e-5 0.0174 input
Post-RMSNorm -8e-5 0.221 input
Post-matmul, pre-bias -0.0159 0.925 YES — Q4K tolerance
qkv_bias itself +0.272 10.243 ⚠ ~10× too large
Post-bias +0.256 10.329 matches existing APR trace

Pre-bias APR matmul output (std=0.92) agrees with GGUF (std=1.14) within Q4K tolerance — the matmul is correct. The bug is in how qkv_bias is loaded or stored in the .apr file.

Falsification chain (now closed at the root)

§15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
→ §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
→ §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
  comparison — q4k_layers populated, F32 fused-qkv ≡ Q4K dispatch)
→ §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)

The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass.

PR E v2 scope (next session)

Per §31.4, ONE specific investigation:

  • Dump APR's blk.0.attn_q.bias / attn_k.bias / attn_v.bias bytes
  • Dump GGUF's same 3 tensors
  • Byte-compare:
    • If APR != GGUF, the GGUF→APR converter is broken (crates/aprender-core/src/format/converter/)
    • If APR == GGUF, the loader (load_qkv_bias in mod_dequant_q4k_apr.rs:210-236) is misinterpreting

Coverage impact

  • Scoreboard unchanged this PR: 15+33 (still pre-DISCHARGE)
  • Will flip to 20+28 when PR E v2 lands (5 MODEL-1 PARTIALs discharge)

Files

  • Spec: §31 added (~80 lines, 8 subsections)
    • 31.1 The decisive empirical bisection
    • 31.2 The verdict
    • 31.3 Falsification chain (now closed at the root)
    • 31.4 PR E v2 scope (one named site to investigate)
    • 31.5 Drift-prevention test (immediate)
    • 31.6 Coverage scoreboard impact
    • 31.7 Methodology note — why this iteration succeeded
    • 31.8 Files
  • Diagnostic (re-runnable on noah-Lambda-Vector):
    • crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs
  • Evidence (live RTX 4090 output):
    • evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
    • evidence/ship-007-qkv-bisection-2026-04-27/findings.md

Test plan

  • Diagnostic builds with cargo build --release -p aprender-serve --example diag_qkv_bisection_layer0
  • Diagnostic ran live on noah-Lambda-Vector RTX 4090; results match expected APR trace numbers (post-bias std=10.33)
  • Spec v2.76.0 self-consistent with §29 scoreboard

Methodology

§30 falsified §28's hypothesis. §31's bisection localized the bug ONE STAGE PER ITERATION (4 stages tested in one pass). Toyota Way "five whys" framework: §28's "ffn_swigl matmul precision" was a SYMPTOMATIC site; §31's "qkv_bias upstream of all this" is the CAUSAL site. The cascade through silu saturation amplifies a 9× layer-0 gap into 18× layer-3 ffn_swigl.

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 27, 2026 14:27
@noahgift noahgift force-pushed the feat/spec-31-ship-007-qkv-bisection branch 2 times, most recently from 1bd5dd3 to 72ea3b4 Compare April 27, 2026 15:54
noahgift and others added 3 commits April 27, 2026 18:56
…=10.24) — spec v2.75.0 → v2.76.0

Live three-stage bisection on canonical 7B teacher pinpoints the divergence
point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0
qkv at four stages with prompt "What is 2+2?":

| Stage | mean | std | Match GGUF (1.14)? |
|-------|------|-----|---------------------|
| Embedding | 1e-5 | 0.0174 | OK (input) |
| Post-RMSNorm | -8e-5 | 0.221 | OK (input) |
| Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance |
| qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large |
| Post-bias | +0.256 | 10.329 | matches APR trace blowup |

The 9× std blowup happens ENTIRELY at the qkv_bias addition step
(pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias
matches APR's existing trace. K-part bias is most extreme (post-bias
std=29.49).

PR E v2 is now scoped to ONE specific investigation per §31.4:

  - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes
  - dump GGUF's same 3 tensors
  - byte-compare:
    - if APR != GGUF, the GGUF→APR converter is broken
    - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting

§31 falsification chain (now closed at the root):

  §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
  → §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
  → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
    comparison)
  → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)

The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages
found it in one pass.

Drift-prevention test for next session (per §31.5): assert per-layer
|APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10.

Files:
- crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
- evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis)
- §31 spec section (8 subsections)
- Header: v2.75.0 → v2.76.0

Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/spec-31-ship-007-qkv-bisection branch from 72ea3b4 to c1967e7 Compare April 27, 2026 16:56
@noahgift noahgift merged commit 0f37400 into main Apr 27, 2026
10 checks passed
@noahgift noahgift deleted the feat/spec-31-ship-007-qkv-bisection branch April 27, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant