evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken by noahgift · Pull Request #1426 · paiml/aprender

noahgift · 2026-05-03T13:18:41Z

Summary

SHIP-007 root cause empirically confirmed: the GPU forward path is broken. Toggling --no-gpu on apr run flips output from gibberish to correct answer.

Configuration	Output for "What is 2+2?"	Time
`apr run` (GPU default)	"ampiezza = 0.5\ndiametro = 10"	72.95s
`apr run --no-gpu`	"2 + 2 equals 4."	9.81s

CPU produces correct answer 7.4× faster than GPU. Same model, same prompt, same greedy decode (--temperature 0.0).

What this eliminates from prior hypotheses

❌ Not an attention-math bug (CPU uses same attention math, works correctly)
❌ Not Q4K dequant precision (both paths dequantize at load)
❌ Not tokenizer/chat-template (same on both paths)
❌ Not autoregressive/KV cache (CPU has both, works)
✅ IS GPU kernel correctness (only varying piece)

The 1.4e-3 cosine drop at attention output that v1/v2/v3 found was real but benign — just Q4K dequant precision noise that doesn't flip argmax. The GIBBERISH from GPU is a separate, much larger error.

GPU surface

apr run GPU path diagnostic shows:

[trueno#243] Manual graph: 646 kernels — trueno_gpu manual graph
[wgpu] CPU fallback for lm_head — hybrid wgpu + CPU
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32

Bug is in those 646 trueno kernels OR the wgpu dispatch.

Implications

SHIP-007 root cause is now localized to a specific surface (GPU path).
Workaround exists: ship MODEL-1 with --no-gpu default (CPU produces correct + faster output).
Fix path actionable: audit trueno_gpu kernels.

Five Whys (full detail in commit)

Why GPU gibberish? Some kernel in trueno_gpu manual graph wrong.
Why missed by v1-v4? They only instrumented forward_traced (CPU path).
Why did attention bisection mislead? cos=0.998 noise was benign.
Why does GPU exist if CPU works? Performance — but GPU is currently wrong AND slower.
Why no test catches this? No CPU-vs-GPU output parity gate exists.

Test plan

Live ran both configurations on canonical 7B teacher
Single-token corroboration (--max-tokens 1)
Multi-token corroboration (--max-tokens 16)
Same prompt, same temperature, same model
Markdown only, no code changes
CI green
Auto-merge

🤖 Generated with Claude Code

…oken Decisive test: same model, same prompt, same greedy decode. Only --no-gpu flag differs. | Configuration | Output for "What is 2+2?" | Time | |---------------------|--------------------------------------|--------| | `apr run` (GPU) | "ampiezza = 0.5\ndiametro = 10" | 72.95s | | `apr run --no-gpu` | "2 + 2 equals 4." | 9.81s | CPU is **CORRECT and 7.4× FASTER**. GPU is **GIBBERISH and slower**. This eliminates every previous v1/v2/v3/v4 hypothesis: - ❌ Not attention math (CPU uses same attention math, works) - ❌ Not Q4K dequant precision (both paths dequantize at load) - ❌ Not tokenizer/chat template (same on both paths) - ❌ Not autoregressive/KV cache (CPU has both, works) - ✅ IS GPU kernel correctness (only varying piece) The GPU surface from `apr run` diagnostics: - `[trueno#243] Manual graph: 646 kernels` — trueno_gpu kernel pipeline - `[wgpu] CPU fallback for lm_head` — hybrid wgpu+CPU - 28.3 GB F32 weights uploaded to GPU after Q4K dequant The bug is in those 646 trueno kernels OR the wgpu dispatch. ## Five Whys 1. Why GPU gibberish? Some kernel in trueno_gpu manual graph computes attention or matmul incorrectly. 2. Why missed by v1/v2/v3? They only instrumented forward_traced (CPU path which IS correct). 3. Why did attention bisection look like bug? cos=0.998 attn_out drop was real but BENIGN — Q4K precision noise not flipping argmax. 4. Why does GPU exist if CPU works? Performance — but GPU is currently both wrong AND slower. Regression somewhere. 5. Why no test catches this? No CPU-vs-GPU output parity gate exists for the wgpu manual graph path. This regression slipped through. Workaround: ship MODEL-1 with `--no-gpu` default OR fix GPU before ship. Fix path is now actionable: audit trueno_gpu manual graph kernels. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arity invariant (#1427) Authors a new schema-kind contract that codifies the invariant uncovered by the SHIP-007 v5 finding (PR #1426): for any model and prompt under greedy decode (`--temperature 0.0`), `apr run` GPU output MUST match `apr run --no-gpu` CPU output. ## Why now The v5 falsifier showed: - `apr run` GPU emits "ampiezza = 0.5\ndiametro = 10" (gibberish, 72.95s) - `apr run --no-gpu` emits "2 + 2 equals 4." (correct, 9.81s) Same model, same prompt, same temperature. This regression slipped through because the existing parity_gate (crates/aprender-serve/src/gguf/cuda/mod_parity_gate.rs) only fires for `OwnedQuantizedModelCuda` construction (used by `apr parity` and `apr run --force-gpu`). The default `apr run` path on .apr files goes through `OwnedQuantizedModel::from_apr` → trueno manual graph (646 kernels), which has NO parity gate. The code at `crates/aprender-serve/src/cli/apr_inference.rs:46` already acknowledges: "Both AprF32ToGpuAdapter and forward_token_apr_q4k produce garbage on GPU." ## Contract structure - **kind**: schema (matches `apr-cli-trace-save-tensor-v1` pattern) - **3 equations**: greedy_argmax_parity, cosine_parity, no_gpu_flag_honor - **4 falsifiers**: - FALSIFY-CPU-GPU-001 (greedy argmax match) — PARTIAL, currently FALSIFIED in live data - FALSIFY-CPU-GPU-002 (cosine ≥ 0.99) — PARTIAL, apr parity reports cos=-0.005 - FALSIFY-CPU-GPU-003 (parity gate enforced at apr run init) — PARTIAL, gap in trueno path - FALSIFY-CPU-GPU-004 (--no-gpu honored) — FUNCTIONAL, verified empirically - **4 proof obligations**: invariant + completeness + liveness gates ## Five Whys 1. Why this contract? To codify a regression class so it can never slip through silently again. 2. Why didn't existing apr-gpu-parity-consistency-v1 cover it? That contract is about CLI scope-clarity (apr parity command output messages), not about runtime output correctness gates in apr run. 3. Why a new contract instead of extending apr-vs-gguf-forward-parity-v1? That one is about APR vs GGUF format parity (different concern). 4. Why PARTIAL on 3 of 4 falsifiers? FALSIFY-CPU-GPU-001/002/003 are all currently FALSIFIED in live data — the gate doesn't exist for the .apr trueno graph path. They're algorithm-bound to existing parity_gate code but functional discharge requires the implementation to extend coverage. 5. Why FUNCTIONAL on 004? Live tested 2026-05-03: --no-gpu produces correct "2 + 2 equals 4." with no [trueno#243] or [PMAT-082] log lines. ## Implementation follow-up (separate PR) Extend mod_parity_gate.rs OR the OwnedQuantizedModel::from_apr loader path to call parity_gate during the trueno graph init. The pattern is already proven in mod.rs:268-279 (with SKIP_PARITY_GATE=1 escape hatch). `pv validate contracts/apr-cpu-vs-gpu-output-parity-v1.yaml` returns 0 errors / 0 warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 13:18

noahgift merged commit 603696d into main May 3, 2026
11 checks passed

noahgift deleted the evidence/ship-007-v5-gpu-path-confirmed branch May 3, 2026 13:42

noahgift mentioned this pull request May 3, 2026

contract(apr-cpu-vs-gpu-output-parity-v1): codify CPU-vs-GPU output parity invariant #1427

Merged

5 tasks

noahgift mentioned this pull request May 4, 2026

docs: pre-v0.32.0 — fill [Unreleased] CHANGELOG + repair README drift gate #1448

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken#1426

evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken#1426
noahgift merged 1 commit into
mainfrom
evidence/ship-007-v5-gpu-path-confirmed

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

What this eliminates from prior hypotheses

GPU surface

Implications

Five Whys (full detail in commit)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant