docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0#1060
Merged
Merged
Conversation
….0 → v2.59.0
Records the SHIP-007 GQA-7:1 parity bug investigation thread captured
during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and
updates the atomic-next-action banner to reference it.
No discharge promotion (coverage tally unchanged at 33+12). This is
investigation-recording, not rule promotion.
What §15 contains:
15.1 Surface symptoms — the two independent observations:
- apr bench parity gate: CPU argmax 334 vs GPU argmax 8127,
cosine=−0.005 (anti-correlated, structural divergence)
- apr qa --json on GGUF: format_parity reports
GGUF argmax=17 != SafeTensors argmax=59260
- 370M MODEL-2 training works on the same RTX 4090, so the bug
is GQA-7:1-specific, not GPU-host-wide.
15.2 Five Whys — traces the surface symptom from the parity-gate
failure down to the load-bearing edge case: GQA's
num_heads ≠ num_kv_heads makes layout-then-reshape order
non-commutative, while MHA (num_heads = num_kv_heads, where
370M training lives) is invariant under either order.
15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape
ordering bug on K and/or V projections that causes CPU and
GPU forward to consume the same physical bytes with different
effective head-axis interpretations, compounding through 28
transformer blocks into anti-correlated logits.
15.4 Falsifiable next investigation step: a single-tensor Q × K^T
element-by-element comparison on
model.layers.0.self_attn.k_proj.weight from the row-major-
guaranteed APR (SHIP-003 PR #1059), then iterate through V,
attention scores, weights, output, and o_proj until the
divergent stage is named. Per feedback_apr_trace_not_eprintln.md,
this is the proper TraceStep-extension path, not eprintln!.
15.5 Side-bug noted: apr diff --transpose-aware appears not to apply
the transpose before cosine computation when shapes are [a,b]
vs [b,a]. Filed as a separate apr-cli ticket. Does not affect
SHIP-007 root-cause analysis — SafeTensors↔APR same-shape
comparison via SHIP-003 #1059 confirmed weight-byte parity at
cos≥0.9999999.
15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs
(SHIP-002 / 005 / 006 / 007 / 008) all transitively block on
this single fix. A single root-cause fix discharges all 5
simultaneously — highest-leverage MODEL-1 work item remaining.
15.7 Methodological note: entire investigation conducted using only
apr CLI tooling (apr diff, apr qa, apr bench, apr inspect).
Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA
kernels. Honors feedback_apr_trace_not_eprintln.md.
Evidence chain:
1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on
15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999.
2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF —
revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors
[out,in].
3. apr qa --json on both APR and GGUF teachers — revealed cross-
format argmax divergence.
4. SHIP-007 GPU parity gate's existing telemetry — confirmed
structural divergence.
Methodological consistency with the 6 PR cascade preceding this
amendment: pure stack tooling, contract-backed numbers, drift-
prevention pattern. This commit is documentation only — no Rust
changes, no contract changes — but pins the investigation thread
durably in the spec where future investigators (and the next
multi-PR TraceStep extension effort) will find it.
Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point
at §15 as the load-bearing investigation surface.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…sifier The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU incremental_attention_gpu kernel parity test on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the SHIP-007-blocking 7B teacher specifically uses. Three tests added to `crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`: 1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads. Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS` produces identical mapping for the canonical 28:4 ratio. 2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) — first-token case (no cache). For attention over a single K/V position, softmax([single_score]) = [1.0] so output = current_v expanded across 4 KV heads to 28 Q heads. CPU reference uses the mirror of `cpu_gqa_attention` from the peer test parameterized at the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs. 3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) — second-token case with one populated K/V cache position. Tests the full attention mechanism with KV cache state. Tolerance: 1e-3 elementwise (slightly looser to accommodate cumulative FP rounding over the 2-position softmax + weighted-sum). Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B shape on synthetic inputs. Materially narrows §15.4's surviving suspect list: - ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass) - ✅ Q × K^T per-head correct - ✅ Softmax-weighted V aggregation correct - ✅ Scale factor (1/√head_dim) at head_dim=128 correct - ✅ KV cache state-management correct - 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention), o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream propagation. This test serves as a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor of incremental attention that breaks 7:1-specific behavior will flip these tests red on `cargo test --features cuda --release -- --ignored`. Spec §15.4 (PR #1060) anticipated this test and named the proper follow-up: a single-tensor matmul parity test on Q/K/V projection weights from the row-major-correct APR (sha256 a394dd28...0ddeb28, verified by SHIP-003 PR #1059). Verification: cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \ --features cuda --release -- --ignored → 3 passed; 0 failed; 0 ignored 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…sifier (#1061) The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU incremental_attention_gpu kernel parity test on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the SHIP-007-blocking 7B teacher specifically uses. Three tests added to `crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`: 1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads. Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS` produces identical mapping for the canonical 28:4 ratio. 2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) — first-token case (no cache). For attention over a single K/V position, softmax([single_score]) = [1.0] so output = current_v expanded across 4 KV heads to 28 Q heads. CPU reference uses the mirror of `cpu_gqa_attention` from the peer test parameterized at the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs. 3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) — second-token case with one populated K/V cache position. Tests the full attention mechanism with KV cache state. Tolerance: 1e-3 elementwise (slightly looser to accommodate cumulative FP rounding over the 2-position softmax + weighted-sum). Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B shape on synthetic inputs. Materially narrows §15.4's surviving suspect list: - ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass) - ✅ Q × K^T per-head correct - ✅ Softmax-weighted V aggregation correct - ✅ Scale factor (1/√head_dim) at head_dim=128 correct - ✅ KV cache state-management correct - 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention), o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream propagation. This test serves as a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor of incremental attention that breaks 7:1-specific behavior will flip these tests red on `cargo test --features cuda --release -- --ignored`. Spec §15.4 (PR #1060) anticipated this test and named the proper follow-up: a single-tensor matmul parity test on Q/K/V projection weights from the row-major-correct APR (sha256 a394dd28...0ddeb28, verified by SHIP-003 PR #1059). Verification: cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \ --features cuda --release -- --ignored → 3 passed; 0 failed; 0 ignored 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Records the SHIP-007 GQA-7:1 parity bug investigation thread from the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner. No discharge promotion (coverage tally unchanged at 33+12) — this is investigation-recording, not rule promotion.
Surface symptoms
apr benchparity gate on 7B Q4_K APRapr qa --jsonon 7B Q4_K GGUF (format_paritygate)Counter-evidence: 370M MODEL-2 from-scratch training works on the same RTX 4090 — the bug is GQA-7:1-specific.
Five Whys (compressed)
apr benchfail at parity? — CPU/GPU argmax structurally differ.transpose-then-reshape≠reshape-then-transposewhennum_heads ≠ num_kv_heads. CPU and GPU forward pick different orderings.Root-cause hypothesis
A GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections such that CPU forward and GPU forward consume the same physical bytes with different effective head-axis interpretations.
Consistent with: cosine=−0.005 (structural, not noisy); 370M training working; cross-format finding; SafeTensors↔APR weight-byte parity (cos≥0.9999999 across 339 tensors via SHIP-003 PR #1059); GGUF↔APR shape asymmetry (
down_projGGUF=[18944, 3584] vs APR=[3584, 18944], expected per LAYOUT-001/002).Falsifiable next investigation step
Single-tensor Q × K^T element-by-element comparison on
model.layers.0.self_attn.k_proj.weightfrom the row-major-guaranteed APR (SHIP-003 PR #1059), CPU vs GPU. If outputs match → bug is downstream of K projection. If not → divergent stage localized to K projection. Iterate through V, attention scores, weights, output, o_proj.Per
feedback_apr_trace_not_eprintln.md: extendTraceStepenum with intra-attention/intra-FFN variants behind a contract, add--device cpu|gputoapr trace, thenapr diff cpu_trace.json gpu_trace.json --values. No raweprintln!.Blast radius — single fix discharges 5 MODEL-1 PARTIALs
apr runapr evalapr qa(format_parity / ollama_parity / ptx_parity)apr bench(parity-gate itself)apr runHighest-leverage MODEL-1 work item remaining.
Side-bug noted
apr diff --transpose-awarereturns cos=0.0003 on transposed shapes despite the flag — the help text claims "Account for transpose when comparing (GGUF col-major vs APR row-major)" but the cosine computation doesn't appear to apply the transpose. Filed as a separate apr-cli ticket; does NOT affect SHIP-007 root-cause analysis.Methodological note
Entire investigation conducted using only apr CLI tooling (
apr diff,apr qa,apr bench,apr inspect). Zeroeprintln!injected into forward.rs / ffn_block.rs / CUDA kernels. Honorsfeedback_apr_trace_not_eprintln.md.Files changed
docs/specifications/aprender-train/ship-two-models-spec.md🤖 Generated with Claude Code