evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken#1426
Merged
Conversation
…oken Decisive test: same model, same prompt, same greedy decode. Only --no-gpu flag differs. | Configuration | Output for "What is 2+2?" | Time | |---------------------|--------------------------------------|--------| | `apr run` (GPU) | "ampiezza = 0.5\ndiametro = 10" | 72.95s | | `apr run --no-gpu` | "2 + 2 equals 4." | 9.81s | CPU is **CORRECT and 7.4× FASTER**. GPU is **GIBBERISH and slower**. This eliminates every previous v1/v2/v3/v4 hypothesis: - ❌ Not attention math (CPU uses same attention math, works) - ❌ Not Q4K dequant precision (both paths dequantize at load) - ❌ Not tokenizer/chat template (same on both paths) - ❌ Not autoregressive/KV cache (CPU has both, works) - ✅ IS GPU kernel correctness (only varying piece) The GPU surface from `apr run` diagnostics: - `[trueno#243] Manual graph: 646 kernels` — trueno_gpu kernel pipeline - `[wgpu] CPU fallback for lm_head` — hybrid wgpu+CPU - 28.3 GB F32 weights uploaded to GPU after Q4K dequant The bug is in those 646 trueno kernels OR the wgpu dispatch. ## Five Whys 1. Why GPU gibberish? Some kernel in trueno_gpu manual graph computes attention or matmul incorrectly. 2. Why missed by v1/v2/v3? They only instrumented forward_traced (CPU path which IS correct). 3. Why did attention bisection look like bug? cos=0.998 attn_out drop was real but BENIGN — Q4K precision noise not flipping argmax. 4. Why does GPU exist if CPU works? Performance — but GPU is currently both wrong AND slower. Regression somewhere. 5. Why no test catches this? No CPU-vs-GPU output parity gate exists for the wgpu manual graph path. This regression slipped through. Workaround: ship MODEL-1 with `--no-gpu` default OR fix GPU before ship. Fix path is now actionable: audit trueno_gpu manual graph kernels. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…arity invariant (#1427) Authors a new schema-kind contract that codifies the invariant uncovered by the SHIP-007 v5 finding (PR #1426): for any model and prompt under greedy decode (`--temperature 0.0`), `apr run` GPU output MUST match `apr run --no-gpu` CPU output. ## Why now The v5 falsifier showed: - `apr run` GPU emits "ampiezza = 0.5\ndiametro = 10" (gibberish, 72.95s) - `apr run --no-gpu` emits "2 + 2 equals 4." (correct, 9.81s) Same model, same prompt, same temperature. This regression slipped through because the existing parity_gate (crates/aprender-serve/src/gguf/cuda/mod_parity_gate.rs) only fires for `OwnedQuantizedModelCuda` construction (used by `apr parity` and `apr run --force-gpu`). The default `apr run` path on .apr files goes through `OwnedQuantizedModel::from_apr` → trueno manual graph (646 kernels), which has NO parity gate. The code at `crates/aprender-serve/src/cli/apr_inference.rs:46` already acknowledges: "Both AprF32ToGpuAdapter and forward_token_apr_q4k produce garbage on GPU." ## Contract structure - **kind**: schema (matches `apr-cli-trace-save-tensor-v1` pattern) - **3 equations**: greedy_argmax_parity, cosine_parity, no_gpu_flag_honor - **4 falsifiers**: - FALSIFY-CPU-GPU-001 (greedy argmax match) — PARTIAL, currently FALSIFIED in live data - FALSIFY-CPU-GPU-002 (cosine ≥ 0.99) — PARTIAL, apr parity reports cos=-0.005 - FALSIFY-CPU-GPU-003 (parity gate enforced at apr run init) — PARTIAL, gap in trueno path - FALSIFY-CPU-GPU-004 (--no-gpu honored) — FUNCTIONAL, verified empirically - **4 proof obligations**: invariant + completeness + liveness gates ## Five Whys 1. Why this contract? To codify a regression class so it can never slip through silently again. 2. Why didn't existing apr-gpu-parity-consistency-v1 cover it? That contract is about CLI scope-clarity (apr parity command output messages), not about runtime output correctness gates in apr run. 3. Why a new contract instead of extending apr-vs-gguf-forward-parity-v1? That one is about APR vs GGUF format parity (different concern). 4. Why PARTIAL on 3 of 4 falsifiers? FALSIFY-CPU-GPU-001/002/003 are all currently FALSIFIED in live data — the gate doesn't exist for the .apr trueno graph path. They're algorithm-bound to existing parity_gate code but functional discharge requires the implementation to extend coverage. 5. Why FUNCTIONAL on 004? Live tested 2026-05-03: --no-gpu produces correct "2 + 2 equals 4." with no [trueno#243] or [PMAT-082] log lines. ## Implementation follow-up (separate PR) Extend mod_parity_gate.rs OR the OwnedQuantizedModel::from_apr loader path to call parity_gate during the trueno graph init. The pattern is already proven in mod.rs:268-279 (with SKIP_PARITY_GATE=1 escape hatch). `pv validate contracts/apr-cpu-vs-gpu-output-parity-v1.yaml` returns 0 errors / 0 warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SHIP-007 root cause empirically confirmed: the GPU forward path is broken. Toggling
--no-gpuonapr runflips output from gibberish to correct answer.apr run(GPU default)apr run --no-gpuCPU produces correct answer 7.4× faster than GPU. Same model, same prompt, same greedy decode (
--temperature 0.0).What this eliminates from prior hypotheses
The 1.4e-3 cosine drop at attention output that v1/v2/v3 found was real but benign — just Q4K dequant precision noise that doesn't flip argmax. The GIBBERISH from GPU is a separate, much larger error.
GPU surface
apr runGPU path diagnostic shows:[trueno#243] Manual graph: 646 kernels— trueno_gpu manual graph[wgpu] CPU fallback for lm_head— hybrid wgpu + CPU[PMAT-333] Dequantized 337 weights, 28282.5 MB F32Bug is in those 646 trueno kernels OR the wgpu dispatch.
Implications
--no-gpudefault (CPU produces correct + faster output).Five Whys (full detail in commit)
Test plan
--max-tokens 1)--max-tokens 16)🤖 Generated with Claude Code