Skip to content

test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause#1061

Merged
noahgift merged 1 commit into
mainfrom
test/qwen2-gqa-7-1-cpu-gpu-parity
Apr 26, 2026
Merged

test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause#1061
noahgift merged 1 commit into
mainfrom
test/qwen2-gqa-7-1-cpu-gpu-parity

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

SHIP-007 §15.4 falsifier from spec v2.59.0 PR #1060. Adds CPU vs GPU GQA-7:1 attention parity tests on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584) — a non-power-of-2 ratio (q_per_kv=7) that the existing TinyLlama 8:1 test (gqa_attention_parity.rs) does not exercise.

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

test result: ok. 3 passed; 0 failed; 0 ignored;

Material narrowing of SHIP-007 root-cause search

The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause. CPU and GPU outputs are bit-equivalent (within FP rounding tolerance) for the canonical 28:4:128:3584 shape on synthetic inputs.

Eliminated:

  • ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
  • ✅ Q × K^T per-head correct
  • ✅ Softmax-weighted V aggregation correct
  • ✅ Scale factor (1/√head_dim) at head_dim=128 correct
  • ✅ KV cache state-management correct (second-token populated cache works)

Surviving suspects per §15.4:

  • 🟡 Q/K/V projection matmul (BEFORE attention)
  • 🟡 o_proj (AFTER attention)
  • 🟡 RMSNorm before/after attention or FFN
  • 🟡 FFN (gate/up/down + swiglu)
  • 🟡 LM head projection
  • 🟡 Multi-layer KV cache layout (not state)
  • 🟡 Residual stream propagation across 28 blocks

Three tests added

  1. ship_007_qwen2_gqa_7_1_head_mapping_property — pure arithmetic check on (q_head * NUM_KV_HEADS) / NUM_HEADS = q_head / q_per_kv for all 28 q_heads (always-on, no GPU required).
  2. ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token (#[ignore]) — first-token case, no cache, tolerance 1e-4.
  3. ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token (#[ignore]) — second-token case with 1-pos cache, tolerance 1e-3.

#[ignore] mirrors peer GQA test pattern (run via --ignored on hosts with CUDA).

Test plan

  • cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity --features cuda --release -- --ignored — 3/3 pass on noah-Lambda-Vector RTX 4090
  • head_mapping_property runs in default test set (not ignored)
  • CI workspace-test green (auto)
  • ci / gate green (auto)

Spec reference

This test is the §15.4 falsifier from docs/specifications/aprender-train/ship-two-models-spec.md (PR #1060 spec v2.58.0 → v2.59.0). When that spec amendment lands, a follow-up PR will update §15.4 with the test result and §15.5 with the new "next investigation step" (Q/K/V projection matmul parity, since attention itself is now ruled out).

Files changed

File Change
crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs NEW — 3 tests, ~370 lines

Methodology

Pure stack tooling — exercises existing realizar::cuda::CudaExecutor::incremental_attention_gpu and a CPU reference fn (mirror of peer gqa_attention_parity.rs pattern). No eprintln!, no bash workaround, no parallel implementation. Honors feedback_apr_trace_not_eprintln.md.

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 25, 2026 20:21
…sifier

The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU
incremental_attention_gpu kernel parity test on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's
GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but
doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the
SHIP-007-blocking 7B teacher specifically uses.

Three tests added to
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`:

1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic
   check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads.
   Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS`
   produces identical mapping for the canonical 28:4 ratio.

2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) —
   first-token case (no cache). For attention over a single K/V
   position, softmax([single_score]) = [1.0] so output = current_v
   expanded across 4 KV heads to 28 Q heads. CPU reference uses the
   mirror of `cpu_gqa_attention` from the peer test parameterized at
   the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs.

3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) —
   second-token case with one populated K/V cache position. Tests the
   full attention mechanism with KV cache state. Tolerance: 1e-3
   elementwise (slightly looser to accommodate cumulative FP rounding
   over the 2-position softmax + weighted-sum).

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT
the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent
(within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B
shape on synthetic inputs.

Materially narrows §15.4's surviving suspect list:
- ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
- ✅ Q × K^T per-head correct
- ✅ Softmax-weighted V aggregation correct
- ✅ Scale factor (1/√head_dim) at head_dim=128 correct
- ✅ KV cache state-management correct
- 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention),
  o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV
  cache layout, residual stream propagation.

This test serves as a durable regression guard against the GQA-7:1
attention kernel proper — any future refactor of incremental
attention that breaks 7:1-specific behavior will flip these tests
red on `cargo test --features cuda --release -- --ignored`.

Spec §15.4 (PR #1060) anticipated this test and named the proper
follow-up: a single-tensor matmul parity test on Q/K/V projection
weights from the row-major-correct APR (sha256 a394dd28...0ddeb28,
verified by SHIP-003 PR #1059).

Verification:
  cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \
    --features cuda --release -- --ignored
  → 3 passed; 0 failed; 0 ignored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the test/qwen2-gqa-7-1-cpu-gpu-parity branch from 79e67cf to 4cf3330 Compare April 26, 2026 06:10
@noahgift noahgift merged commit 05dbe6a into main Apr 26, 2026
10 checks passed
@noahgift noahgift deleted the test/qwen2-gqa-7-1-cpu-gpu-parity branch April 26, 2026 06:51
noahgift added a commit that referenced this pull request Apr 26, 2026
…led out (spec v2.59.0 → v2.60.0) (#1062)

Updates spec §15 with the result of the §15.4 falsifier test
(PR #1061): three CPU vs GPU GQA parity tests on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584) all PASS on noah-Lambda-Vector RTX 4090.

Result documented in §15.4 (now titled "Falsifier Run + RESULT"):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

  test result: ok. 3 passed; 0 failed; 0 ignored;

This conclusively rules out the GQA-7:1 incremental_attention_gpu
kernel as the SHIP-007 root cause. Eliminated suspects:
  - Q/K/V head-mapping arithmetic (TinyLlama 8:1 + Qwen 7:1 both pass)
  - Q × K^T per-head correctness
  - Softmax-weighted V aggregation
  - Scale factor 1/√head_dim at head_dim=128
  - Per-head accumulation across 28 Q heads / 4 KV heads
  - Single-position KV cache state-management

Surviving SHIP-007 root-cause candidates (per new §15.5):
  - Q/K/V projection matmul (BEFORE attention) ← next falsifier target
  - o_proj (AFTER attention)
  - RMSNorm before/after attention or FFN
  - FFN (gate/up/down + swiglu)
  - LM head projection
  - Multi-layer KV cache *layout* (across-layer indexing)
  - Layer composition / residual stream propagation

Section 15 renumbering:
  §15.4 — Falsifier Run + RESULT (was: planned test)
  §15.5 — Next Investigation Step (was: §15.4 footer; now a full
          subsection naming Q/K/V projection matmul as the target)
  §15.6 — Side-Bug Surfaced During Investigation (was: §15.5)
  §15.7 — Blast Radius Inventory (was: §15.6)
  §15.8 — Methodological Note (was: §15.7)

Spec v2.59.0 → v2.60.0. No coverage tally change (no new discharge);
this is investigation-result recording. The remaining 5 MODEL-1
PARTIALs still transitively block on the eventual SHIP-007 fix, but
the root-cause search has been materially narrowed.

The §15.4 attention parity test (PR #1061) is now a durable
regression guard against the GQA-7:1 attention kernel proper — any
future refactor that breaks 7:1-specific behavior flips these tests
red.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…cause — spec v2.59.0 → v2.61.0

Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1
teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the
same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30],
same embedded BPE tokenizer:

  APR teacher  → top-1 token=220 (" "), logit=16.7368  ← WRONG
  GGUF teacher → " 2+2 is 4."                          ← CORRECT

Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS),
this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data
layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are
all in the APR-format CPU forward path:

  - Layer-composition glue in forward_single_with_scratch
  - Multi-layer KV cache layout (across-layer indexing)
  - Position embedding (RoPE) layout / sin/cos cache
  - LM head projection

§16.4 specifies the falsifiable next investigation step: `apr trace --payload
--layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever
fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs
(SHIP-002/005/006/007/008) per §15.7's blast-radius inventory.

Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge).
No coverage tally change — investigation-recording amendment, not rule
promotion.

Methodological continuation per feedback_apr_trace_not_eprintln.md: zero
eprintln! added, exact same `apr trace --payload` primitive used in §15.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…cause — spec v2.59.0 → v2.61.0 (#1063)

Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1
teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the
same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30],
same embedded BPE tokenizer:

  APR teacher  → top-1 token=220 (" "), logit=16.7368  ← WRONG
  GGUF teacher → " 2+2 is 4."                          ← CORRECT

Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS),
this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data
layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are
all in the APR-format CPU forward path:

  - Layer-composition glue in forward_single_with_scratch
  - Multi-layer KV cache layout (across-layer indexing)
  - Position embedding (RoPE) layout / sin/cos cache
  - LM head projection

§16.4 specifies the falsifiable next investigation step: `apr trace --payload
--layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever
fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs
(SHIP-002/005/006/007/008) per §15.7's blast-radius inventory.

Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge).
No coverage tally change — investigation-recording amendment, not rule
promotion.

Methodological continuation per feedback_apr_trace_not_eprintln.md: zero
eprintln! added, exact same `apr trace --payload` primitive used in §15.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant