Skip to content

docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0#1060

Merged
noahgift merged 1 commit into
mainfrom
docs/ship-007-five-whys-root-cause
Apr 25, 2026
Merged

docs(ship-007): five-whys + root-cause analysis recorded — spec v2.58.0 → v2.59.0#1060
noahgift merged 1 commit into
mainfrom
docs/ship-007-five-whys-root-cause

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Records the SHIP-007 GQA-7:1 parity bug investigation thread from the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and updates the atomic-next-action banner. No discharge promotion (coverage tally unchanged at 33+12) — this is investigation-recording, not rule promotion.

Surface symptoms

Surface Observation Numerical signature
apr bench parity gate on 7B Q4_K APR Fails at CUDA init CPU argmax=334, GPU argmax=8127, cosine=−0.005, max abs logit Δ=19.5
apr qa --json on 7B Q4_K GGUF (format_parity gate) Cross-format parity BROKEN GGUF argmax=17 != SafeTensors argmax=59260

Counter-evidence: 370M MODEL-2 from-scratch training works on the same RTX 4090 — the bug is GQA-7:1-specific.

Five Whys (compressed)

  1. Why does apr bench fail at parity? — CPU/GPU argmax structurally differ.
  2. Why anti-correlated (cos=−0.005)? — Not noise; structural divergence somewhere in the forward stack.
  3. Why divergent if same .apr bytes? — Two paths share weight bytes but dispatch through different inference codepaths.
  4. Why does it surface only on the 7B teacher? — GQA-7:1 attention (28 Q heads / 4 KV heads) exercises a code path that 370M (different head ratio, full MHA) doesn't.
  5. Why does GQA expose it? — transpose-then-reshapereshape-then-transpose when num_heads ≠ num_kv_heads. CPU and GPU forward pick different orderings.

Root-cause hypothesis

A GQA-7:1-specific layout-vs-reshape ordering bug on K and/or V projections such that CPU forward and GPU forward consume the same physical bytes with different effective head-axis interpretations.

Consistent with: cosine=−0.005 (structural, not noisy); 370M training working; cross-format finding; SafeTensors↔APR weight-byte parity (cos≥0.9999999 across 339 tensors via SHIP-003 PR #1059); GGUF↔APR shape asymmetry (down_proj GGUF=[18944, 3584] vs APR=[3584, 18944], expected per LAYOUT-001/002).

Falsifiable next investigation step

Single-tensor Q × K^T element-by-element comparison on model.layers.0.self_attn.k_proj.weight from the row-major-guaranteed APR (SHIP-003 PR #1059), CPU vs GPU. If outputs match → bug is downstream of K projection. If not → divergent stage localized to K projection. Iterate through V, attention scores, weights, output, o_proj.

Per feedback_apr_trace_not_eprintln.md: extend TraceStep enum with intra-attention/intra-FFN variants behind a contract, add --device cpu|gpu to apr trace, then apr diff cpu_trace.json gpu_trace.json --values. No raw eprintln!.

Blast radius — single fix discharges 5 MODEL-1 PARTIALs

Row Falsification Blocked path
SHIP-002 Python syntax apr run
SHIP-005 HumanEval ≥86% apr eval
SHIP-006 apr qa 8 gates apr qa (format_parity / ollama_parity / ptx_parity)
SHIP-007 decode tps ≥30 apr bench (parity-gate itself)
SHIP-008 Chat template render apr run

Highest-leverage MODEL-1 work item remaining.

Side-bug noted

apr diff --transpose-aware returns cos=0.0003 on transposed shapes despite the flag — the help text claims "Account for transpose when comparing (GGUF col-major vs APR row-major)" but the cosine computation doesn't appear to apply the transpose. Filed as a separate apr-cli ticket; does NOT affect SHIP-007 root-cause analysis.

Methodological note

Entire investigation conducted using only apr CLI tooling (apr diff, apr qa, apr bench, apr inspect). Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA kernels. Honors feedback_apr_trace_not_eprintln.md.

Files changed

File Change
docs/specifications/aprender-train/ship-two-models-spec.md v2.58.0 → v2.59.0; new §15 (7 subsections); atomic-next-action banner updated

🤖 Generated with Claude Code

….0 → v2.59.0

Records the SHIP-007 GQA-7:1 parity bug investigation thread captured
during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and
updates the atomic-next-action banner to reference it.

No discharge promotion (coverage tally unchanged at 33+12). This is
investigation-recording, not rule promotion.

What §15 contains:

15.1 Surface symptoms — the two independent observations:
  - apr bench parity gate: CPU argmax 334 vs GPU argmax 8127,
    cosine=−0.005 (anti-correlated, structural divergence)
  - apr qa --json on GGUF: format_parity reports
    GGUF argmax=17 != SafeTensors argmax=59260
  - 370M MODEL-2 training works on the same RTX 4090, so the bug
    is GQA-7:1-specific, not GPU-host-wide.

15.2 Five Whys — traces the surface symptom from the parity-gate
     failure down to the load-bearing edge case: GQA's
     num_heads ≠ num_kv_heads makes layout-then-reshape order
     non-commutative, while MHA (num_heads = num_kv_heads, where
     370M training lives) is invariant under either order.

15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape
     ordering bug on K and/or V projections that causes CPU and
     GPU forward to consume the same physical bytes with different
     effective head-axis interpretations, compounding through 28
     transformer blocks into anti-correlated logits.

15.4 Falsifiable next investigation step: a single-tensor Q × K^T
     element-by-element comparison on
     model.layers.0.self_attn.k_proj.weight from the row-major-
     guaranteed APR (SHIP-003 PR #1059), then iterate through V,
     attention scores, weights, output, and o_proj until the
     divergent stage is named. Per feedback_apr_trace_not_eprintln.md,
     this is the proper TraceStep-extension path, not eprintln!.

15.5 Side-bug noted: apr diff --transpose-aware appears not to apply
     the transpose before cosine computation when shapes are [a,b]
     vs [b,a]. Filed as a separate apr-cli ticket. Does not affect
     SHIP-007 root-cause analysis — SafeTensors↔APR same-shape
     comparison via SHIP-003 #1059 confirmed weight-byte parity at
     cos≥0.9999999.

15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs
     (SHIP-002 / 005 / 006 / 007 / 008) all transitively block on
     this single fix. A single root-cause fix discharges all 5
     simultaneously — highest-leverage MODEL-1 work item remaining.

15.7 Methodological note: entire investigation conducted using only
     apr CLI tooling (apr diff, apr qa, apr bench, apr inspect).
     Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA
     kernels. Honors feedback_apr_trace_not_eprintln.md.

Evidence chain:
1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on
   15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999.
2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF —
   revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors
   [out,in].
3. apr qa --json on both APR and GGUF teachers — revealed cross-
   format argmax divergence.
4. SHIP-007 GPU parity gate's existing telemetry — confirmed
   structural divergence.

Methodological consistency with the 6 PR cascade preceding this
amendment: pure stack tooling, contract-backed numbers, drift-
prevention pattern. This commit is documentation only — no Rust
changes, no contract changes — but pins the investigation thread
durably in the spec where future investigators (and the next
multi-PR TraceStep extension effort) will find it.

Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point
at §15 as the load-bearing investigation surface.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 25, 2026 20:08
@noahgift noahgift merged commit 6fe95c4 into main Apr 25, 2026
11 checks passed
@noahgift noahgift deleted the docs/ship-007-five-whys-root-cause branch April 25, 2026 20:34
noahgift added a commit that referenced this pull request Apr 26, 2026
…sifier

The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU
incremental_attention_gpu kernel parity test on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's
GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but
doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the
SHIP-007-blocking 7B teacher specifically uses.

Three tests added to
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`:

1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic
   check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads.
   Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS`
   produces identical mapping for the canonical 28:4 ratio.

2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) —
   first-token case (no cache). For attention over a single K/V
   position, softmax([single_score]) = [1.0] so output = current_v
   expanded across 4 KV heads to 28 Q heads. CPU reference uses the
   mirror of `cpu_gqa_attention` from the peer test parameterized at
   the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs.

3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) —
   second-token case with one populated K/V cache position. Tests the
   full attention mechanism with KV cache state. Tolerance: 1e-3
   elementwise (slightly looser to accommodate cumulative FP rounding
   over the 2-position softmax + weighted-sum).

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT
the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent
(within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B
shape on synthetic inputs.

Materially narrows §15.4's surviving suspect list:
- ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
- ✅ Q × K^T per-head correct
- ✅ Softmax-weighted V aggregation correct
- ✅ Scale factor (1/√head_dim) at head_dim=128 correct
- ✅ KV cache state-management correct
- 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention),
  o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV
  cache layout, residual stream propagation.

This test serves as a durable regression guard against the GQA-7:1
attention kernel proper — any future refactor of incremental
attention that breaks 7:1-specific behavior will flip these tests
red on `cargo test --features cuda --release -- --ignored`.

Spec §15.4 (PR #1060) anticipated this test and named the proper
follow-up: a single-tensor matmul parity test on Q/K/V projection
weights from the row-major-correct APR (sha256 a394dd28...0ddeb28,
verified by SHIP-003 PR #1059).

Verification:
  cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \
    --features cuda --release -- --ignored
  → 3 passed; 0 failed; 0 ignored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…sifier (#1061)

The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU
incremental_attention_gpu kernel parity test on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's
GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but
doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the
SHIP-007-blocking 7B teacher specifically uses.

Three tests added to
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`:

1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic
   check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads.
   Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS`
   produces identical mapping for the canonical 28:4 ratio.

2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) —
   first-token case (no cache). For attention over a single K/V
   position, softmax([single_score]) = [1.0] so output = current_v
   expanded across 4 KV heads to 28 Q heads. CPU reference uses the
   mirror of `cpu_gqa_attention` from the peer test parameterized at
   the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs.

3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) —
   second-token case with one populated K/V cache position. Tests the
   full attention mechanism with KV cache state. Tolerance: 1e-3
   elementwise (slightly looser to accommodate cumulative FP rounding
   over the 2-position softmax + weighted-sum).

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT
the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent
(within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B
shape on synthetic inputs.

Materially narrows §15.4's surviving suspect list:
- ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
- ✅ Q × K^T per-head correct
- ✅ Softmax-weighted V aggregation correct
- ✅ Scale factor (1/√head_dim) at head_dim=128 correct
- ✅ KV cache state-management correct
- 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention),
  o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV
  cache layout, residual stream propagation.

This test serves as a durable regression guard against the GQA-7:1
attention kernel proper — any future refactor of incremental
attention that breaks 7:1-specific behavior will flip these tests
red on `cargo test --features cuda --release -- --ignored`.

Spec §15.4 (PR #1060) anticipated this test and named the proper
follow-up: a single-tensor matmul parity test on Q/K/V projection
weights from the row-major-correct APR (sha256 a394dd28...0ddeb28,
verified by SHIP-003 PR #1059).

Verification:
  cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \
    --features cuda --release -- --ignored
  → 3 passed; 0 failed; 0 ignored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant