test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause by noahgift · Pull Request #1061 · paiml/aprender

noahgift · 2026-04-25T20:21:39Z

Summary

SHIP-007 §15.4 falsifier from spec v2.59.0 PR #1060. Adds CPU vs GPU GQA-7:1 attention parity tests on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584) — a non-power-of-2 ratio (q_per_kv=7) that the existing TinyLlama 8:1 test (gqa_attention_parity.rs) does not exercise.

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

test result: ok. 3 passed; 0 failed; 0 ignored;

Material narrowing of SHIP-007 root-cause search

The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause. CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical 28:4:128:3584 shape on synthetic inputs.

Eliminated:

✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
✅ Q × K^T per-head correct
✅ Softmax-weighted V aggregation correct
✅ Scale factor (1/√head_dim) at head_dim=128 correct
✅ KV cache state-management correct (second-token populated cache works)

Surviving suspects per §15.4:

🟡 Q/K/V projection matmul (BEFORE attention)
🟡 o_proj (AFTER attention)
🟡 RMSNorm before/after attention or FFN
🟡 FFN (gate/up/down + swiglu)
🟡 LM head projection
🟡 Multi-layer KV cache layout (not state)
🟡 Residual stream propagation across 28 blocks

Three tests added

ship_007_qwen2_gqa_7_1_head_mapping_property — pure arithmetic check on (q_head * NUM_KV_HEADS) / NUM_HEADS = q_head / q_per_kv for all 28 q_heads (always-on, no GPU required).
ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token (#[ignore]) — first-token case, no cache, tolerance 1e-4.
ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token (#[ignore]) — second-token case with 1-pos cache, tolerance 1e-3.

#[ignore] mirrors peer GQA test pattern (run via --ignored on hosts with CUDA).

Test plan

cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity --features cuda --release -- --ignored — 3/3 pass on noah-Lambda-Vector RTX 4090
head_mapping_property runs in default test set (not ignored)
CI workspace-test green (auto)
ci / gate green (auto)

Spec reference

This test is the §15.4 falsifier from docs/specifications/aprender-train/ship-two-models-spec.md (PR #1060 spec v2.58.0 → v2.59.0). When that spec amendment lands, a follow-up PR will update §15.4 with the test result and §15.5 with the new "next investigation step" (Q/K/V projection matmul parity, since attention itself is now ruled out).

Files changed

File	Change
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`	NEW — 3 tests, ~370 lines

Methodology

Pure stack tooling — exercises existing realizar::cuda::CudaExecutor::incremental_attention_gpu and a CPU reference fn (mirror of peer gqa_attention_parity.rs pattern). No eprintln!, no bash workaround, no parallel implementation. Honors feedback_apr_trace_not_eprintln.md.

🤖 Generated with Claude Code

…sifier The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU incremental_attention_gpu kernel parity test on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the SHIP-007-blocking 7B teacher specifically uses. Three tests added to `crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`: 1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads. Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS` produces identical mapping for the canonical 28:4 ratio. 2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) — first-token case (no cache). For attention over a single K/V position, softmax([single_score]) = [1.0] so output = current_v expanded across 4 KV heads to 28 Q heads. CPU reference uses the mirror of `cpu_gqa_attention` from the peer test parameterized at the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs. 3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) — second-token case with one populated K/V cache position. Tests the full attention mechanism with KV cache state. Tolerance: 1e-3 elementwise (slightly looser to accommodate cumulative FP rounding over the 2-position softmax + weighted-sum). Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B shape on synthetic inputs. Materially narrows §15.4's surviving suspect list: - ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass) - ✅ Q × K^T per-head correct - ✅ Softmax-weighted V aggregation correct - ✅ Scale factor (1/√head_dim) at head_dim=128 correct - ✅ KV cache state-management correct - 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention), o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream propagation. This test serves as a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor of incremental attention that breaks 7:1-specific behavior will flip these tests red on `cargo test --features cuda --release -- --ignored`. Spec §15.4 (PR #1060) anticipated this test and named the proper follow-up: a single-tensor matmul parity test on Q/K/V projection weights from the row-major-correct APR (sha256 a394dd28...0ddeb28, verified by SHIP-003 PR #1059). Verification: cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \ --features cuda --release -- --ignored → 3 passed; 0 failed; 0 ignored 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…led out (spec v2.59.0 → v2.60.0) (#1062) Updates spec §15 with the result of the §15.4 falsifier test (PR #1061): three CPU vs GPU GQA parity tests on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584) all PASS on noah-Lambda-Vector RTX 4090. Result documented in §15.4 (now titled "Falsifier Run + RESULT"): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok test result: ok. 3 passed; 0 failed; 0 ignored; This conclusively rules out the GQA-7:1 incremental_attention_gpu kernel as the SHIP-007 root cause. Eliminated suspects: - Q/K/V head-mapping arithmetic (TinyLlama 8:1 + Qwen 7:1 both pass) - Q × K^T per-head correctness - Softmax-weighted V aggregation - Scale factor 1/√head_dim at head_dim=128 - Per-head accumulation across 28 Q heads / 4 KV heads - Single-position KV cache state-management Surviving SHIP-007 root-cause candidates (per new §15.5): - Q/K/V projection matmul (BEFORE attention) ← next falsifier target - o_proj (AFTER attention) - RMSNorm before/after attention or FFN - FFN (gate/up/down + swiglu) - LM head projection - Multi-layer KV cache *layout* (across-layer indexing) - Layer composition / residual stream propagation Section 15 renumbering: §15.4 — Falsifier Run + RESULT (was: planned test) §15.5 — Next Investigation Step (was: §15.4 footer; now a full subsection naming Q/K/V projection matmul as the target) §15.6 — Side-Bug Surfaced During Investigation (was: §15.5) §15.7 — Blast Radius Inventory (was: §15.6) §15.8 — Methodological Note (was: §15.7) Spec v2.59.0 → v2.60.0. No coverage tally change (no new discharge); this is investigation-result recording. The remaining 5 MODEL-1 PARTIALs still transitively block on the eventual SHIP-007 fix, but the root-cause search has been materially narrowed. The §15.4 attention parity test (PR #1061) is now a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor that breaks 7:1-specific behavior flips these tests red. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…cause — spec v2.59.0 → v2.61.0 Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30], same embedded BPE tokenizer: APR teacher → top-1 token=220 (" "), logit=16.7368 ← WRONG GGUF teacher → " 2+2 is 4." ← CORRECT Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS), this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are all in the APR-format CPU forward path: - Layer-composition glue in forward_single_with_scratch - Multi-layer KV cache layout (across-layer indexing) - Position embedding (RoPE) layout / sin/cos cache - LM head projection §16.4 specifies the falsifiable next investigation step: `apr trace --payload --layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008) per §15.7's blast-radius inventory. Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge). No coverage tally change — investigation-recording amendment, not rule promotion. Methodological continuation per feedback_apr_trace_not_eprintln.md: zero eprintln! added, exact same `apr trace --payload` primitive used in §15. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cause — spec v2.59.0 → v2.61.0 (#1063) Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30], same embedded BPE tokenizer: APR teacher → top-1 token=220 (" "), logit=16.7368 ← WRONG GGUF teacher → " 2+2 is 4." ← CORRECT Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS), this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are all in the APR-format CPU forward path: - Layer-composition glue in forward_single_with_scratch - Multi-layer KV cache layout (across-layer indexing) - Position embedding (RoPE) layout / sin/cos cache - LM head projection §16.4 specifies the falsifiable next investigation step: `apr trace --payload --layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008) per §15.7's blast-radius inventory. Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge). No coverage tally change — investigation-recording amendment, not rule promotion. Methodological continuation per feedback_apr_trace_not_eprintln.md: zero eprintln! added, exact same `apr trace --payload` primitive used in §15. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 25, 2026 20:21

noahgift force-pushed the test/qwen2-gqa-7-1-cpu-gpu-parity branch from 79e67cf to 4cf3330 Compare April 26, 2026 06:10

This was referenced Apr 26, 2026

docs(ship-007): §15.4 falsifier RESULT — attention kernel ruled out as root cause (spec v2.59.0 → v2.60.0) #1062

Merged

docs(ship-007): §16 APR forward CPU path isolated as root cause #1063

Merged

noahgift merged commit 05dbe6a into main Apr 26, 2026
10 checks passed

noahgift deleted the test/qwen2-gqa-7-1-cpu-gpu-parity branch April 26, 2026 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause#1061

test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause#1061
noahgift merged 1 commit into
mainfrom
test/qwen2-gqa-7-1-cpu-gpu-parity

noahgift commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 25, 2026

Summary

Material narrowing of SHIP-007 root-cause search

Three tests added

Test plan

Spec reference

Files changed

Methodology

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant