docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0 by noahgift · Pull Request #1060 · paiml/aprender

noahgift · 2026-04-25T20:08:00Z

Summary

Records the SHIP-007 GQA-7:1 parity bug investigation thread from the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner. No discharge promotion (coverage tally unchanged at 33+12) — this is investigation-recording, not rule promotion.

Surface symptoms

Surface	Observation	Numerical signature
`apr bench` parity gate on 7B Q4_K APR	Fails at CUDA init	CPU argmax=334, GPU argmax=8127, cosine=−0.005, max abs logit Δ=19.5
`apr qa --json` on 7B Q4_K GGUF (`format_parity` gate)	Cross-format parity BROKEN	GGUF argmax=17 != SafeTensors argmax=59260

Counter-evidence: 370M MODEL-2 from-scratch training works on the same RTX 4090 — the bug is GQA-7:1-specific.

Five Whys (compressed)

Why does apr bench fail at parity? — CPU/GPU argmax structurally differ.
Why anti-correlated (cos=−0.005)? — Not noise; structural divergence somewhere in the forward stack.
Why divergent if same .apr bytes? — Two paths share weight bytes but dispatch through different inference codepaths.
Why does it surface only on the 7B teacher? — GQA-7:1 attention (28 Q heads / 4 KV heads) exercises a code path that 370M (different head ratio, full MHA) doesn't.
Why does GQA expose it? — transpose-then-reshape ≠ reshape-then-transpose when num_heads ≠ num_kv_heads. CPU and GPU forward pick different orderings.

Root-cause hypothesis

A GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections such that CPU forward and GPU forward consume the same physical bytes with different effective head-axis interpretations.

Consistent with: cosine=−0.005 (structural, not noisy); 370M training working; cross-format finding; SafeTensors↔APR weight-byte parity (cos≥0.9999999 across 339 tensors via SHIP-003 PR #1059); GGUF↔APR shape asymmetry (down_proj GGUF=[18944, 3584] vs APR=[3584, 18944], expected per LAYOUT-001/002).

Falsifiable next investigation step

Single-tensor Q × K^T element-by-element comparison on model.layers.0.self_attn.k_proj.weight from the row-major-guaranteed APR (SHIP-003 PR #1059), CPU vs GPU. If outputs match → bug is downstream of K projection. If not → divergent stage localized to K projection. Iterate through V, attention scores, weights, output, o_proj.

Per feedback_apr_trace_not_eprintln.md: extend TraceStep enum with intra-attention/intra-FFN variants behind a contract, add --device cpu|gpu to apr trace, then apr diff cpu_trace.json gpu_trace.json --values. No raw eprintln!.

Blast radius — single fix discharges 5 MODEL-1 PARTIALs

Row	Falsification	Blocked path
SHIP-002	Python syntax	`apr run`
SHIP-005	HumanEval ≥86%	`apr eval`
SHIP-006	apr qa 8 gates	`apr qa` (format_parity / ollama_parity / ptx_parity)
SHIP-007	decode tps ≥30	`apr bench` (parity-gate itself)
SHIP-008	Chat template render	`apr run`

Highest-leverage MODEL-1 work item remaining.

Side-bug noted

apr diff --transpose-aware returns cos=0.0003 on transposed shapes despite the flag — the help text claims "Account for transpose when comparing (GGUF col-major vs APR row-major)" but the cosine computation doesn't appear to apply the transpose. Filed as a separate apr-cli ticket; does NOT affect SHIP-007 root-cause analysis.

Methodological note

Entire investigation conducted using only apr CLI tooling (apr diff, apr qa, apr bench, apr inspect). Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA kernels. Honors feedback_apr_trace_not_eprintln.md.

Files changed

File	Change
`docs/specifications/aprender-train/ship-two-models-spec.md`	v2.58.0 → v2.59.0; new §15 (7 subsections); atomic-next-action banner updated

🤖 Generated with Claude Code

….0 → v2.59.0 Records the SHIP-007 GQA-7:1 parity bug investigation thread captured during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner to reference it. No discharge promotion (coverage tally unchanged at 33+12). This is investigation-recording, not rule promotion. What §15 contains: 15.1 Surface symptoms — the two independent observations: - apr bench parity gate: CPU argmax 334 vs GPU argmax 8127, cosine=−0.005 (anti-correlated, structural divergence) - apr qa --json on GGUF: format_parity reports GGUF argmax=17 != SafeTensors argmax=59260 - 370M MODEL-2 training works on the same RTX 4090, so the bug is GQA-7:1-specific, not GPU-host-wide. 15.2 Five Whys — traces the surface symptom from the parity-gate failure down to the load-bearing edge case: GQA's num_heads ≠ num_kv_heads makes layout-then-reshape order non-commutative, while MHA (num_heads = num_kv_heads, where 370M training lives) is invariant under either order. 15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections that causes CPU and GPU forward to consume the same physical bytes with different effective head-axis interpretations, compounding through 28 transformer blocks into anti-correlated logits. 15.4 Falsifiable next investigation step: a single-tensor Q × K^T element-by-element comparison on model.layers.0.self_attn.k_proj.weight from the row-major- guaranteed APR (SHIP-003 PR #1059), then iterate through V, attention scores, weights, output, and o_proj until the divergent stage is named. Per feedback_apr_trace_not_eprintln.md, this is the proper TraceStep-extension path, not eprintln!. 15.5 Side-bug noted: apr diff --transpose-aware appears not to apply the transpose before cosine computation when shapes are [a,b] vs [b,a]. Filed as a separate apr-cli ticket. Does not affect SHIP-007 root-cause analysis — SafeTensors↔APR same-shape comparison via SHIP-003 #1059 confirmed weight-byte parity at cos≥0.9999999. 15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs (SHIP-002 / 005 / 006 / 007 / 008) all transitively block on this single fix. A single root-cause fix discharges all 5 simultaneously — highest-leverage MODEL-1 work item remaining. 15.7 Methodological note: entire investigation conducted using only apr CLI tooling (apr diff, apr qa, apr bench, apr inspect). Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA kernels. Honors feedback_apr_trace_not_eprintln.md. Evidence chain: 1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on 15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999. 2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF — revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors [out,in]. 3. apr qa --json on both APR and GGUF teachers — revealed cross- format argmax divergence. 4. SHIP-007 GPU parity gate's existing telemetry — confirmed structural divergence. Methodological consistency with the 6 PR cascade preceding this amendment: pure stack tooling, contract-backed numbers, drift- prevention pattern. This commit is documentation only — no Rust changes, no contract changes — but pins the investigation thread durably in the spec where future investigators (and the next multi-PR TraceStep extension effort) will find it. Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point at §15 as the load-bearing investigation surface. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…sifier The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU incremental_attention_gpu kernel parity test on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the SHIP-007-blocking 7B teacher specifically uses. Three tests added to `crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`: 1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads. Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS` produces identical mapping for the canonical 28:4 ratio. 2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) — first-token case (no cache). For attention over a single K/V position, softmax([single_score]) = [1.0] so output = current_v expanded across 4 KV heads to 28 Q heads. CPU reference uses the mirror of `cpu_gqa_attention` from the peer test parameterized at the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs. 3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) — second-token case with one populated K/V cache position. Tests the full attention mechanism with KV cache state. Tolerance: 1e-3 elementwise (slightly looser to accommodate cumulative FP rounding over the 2-position softmax + weighted-sum). Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B shape on synthetic inputs. Materially narrows §15.4's surviving suspect list: - ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass) - ✅ Q × K^T per-head correct - ✅ Softmax-weighted V aggregation correct - ✅ Scale factor (1/√head_dim) at head_dim=128 correct - ✅ KV cache state-management correct - 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention), o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream propagation. This test serves as a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor of incremental attention that breaks 7:1-specific behavior will flip these tests red on `cargo test --features cuda --release -- --ignored`. Spec §15.4 (PR #1060) anticipated this test and named the proper follow-up: a single-tensor matmul parity test on Q/K/V projection weights from the row-major-correct APR (sha256 a394dd28...0ddeb28, verified by SHIP-003 PR #1059). Verification: cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \ --features cuda --release -- --ignored → 3 passed; 0 failed; 0 ignored 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…sifier (#1061) The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU incremental_attention_gpu kernel parity test on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the SHIP-007-blocking 7B teacher specifically uses. Three tests added to `crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`: 1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads. Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS` produces identical mapping for the canonical 28:4 ratio. 2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) — first-token case (no cache). For attention over a single K/V position, softmax([single_score]) = [1.0] so output = current_v expanded across 4 KV heads to 28 Q heads. CPU reference uses the mirror of `cpu_gqa_attention` from the peer test parameterized at the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs. 3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) — second-token case with one populated K/V cache position. Tests the full attention mechanism with KV cache state. Tolerance: 1e-3 elementwise (slightly looser to accommodate cumulative FP rounding over the 2-position softmax + weighted-sum). Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B shape on synthetic inputs. Materially narrows §15.4's surviving suspect list: - ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass) - ✅ Q × K^T per-head correct - ✅ Softmax-weighted V aggregation correct - ✅ Scale factor (1/√head_dim) at head_dim=128 correct - ✅ KV cache state-management correct - 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention), o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream propagation. This test serves as a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor of incremental attention that breaks 7:1-specific behavior will flip these tests red on `cargo test --features cuda --release -- --ignored`. Spec §15.4 (PR #1060) anticipated this test and named the proper follow-up: a single-tensor matmul parity test on Q/K/V projection weights from the row-major-correct APR (sha256 a394dd28...0ddeb28, verified by SHIP-003 PR #1059). Verification: cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \ --features cuda --release -- --ignored → 3 passed; 0 failed; 0 ignored 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 25, 2026 20:08

noahgift mentioned this pull request Apr 25, 2026

test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause #1061

Merged

4 tasks

noahgift merged commit 6fe95c4 into main Apr 25, 2026
11 checks passed

noahgift deleted the docs/ship-007-five-whys-root-cause branch April 25, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0#1060

docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0#1060
noahgift merged 1 commit into
mainfrom
docs/ship-007-five-whys-root-cause

noahgift commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 25, 2026

Summary

Surface symptoms

Five Whys (compressed)

Root-cause hypothesis

Falsifiable next investigation step

Blast radius — single fix discharges 5 MODEL-1 PARTIALs

Side-bug noted

Methodological note

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant