Skip to content

evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken#1426

Merged
noahgift merged 1 commit into
mainfrom
evidence/ship-007-v5-gpu-path-confirmed
May 3, 2026
Merged

evidence(ship-007): v5 — ROOT CAUSE CONFIRMED, GPU forward path is broken#1426
noahgift merged 1 commit into
mainfrom
evidence/ship-007-v5-gpu-path-confirmed

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

SHIP-007 root cause empirically confirmed: the GPU forward path is broken. Toggling --no-gpu on apr run flips output from gibberish to correct answer.

Configuration Output for "What is 2+2?" Time
apr run (GPU default) "ampiezza = 0.5\ndiametro = 10" 72.95s
apr run --no-gpu "2 + 2 equals 4." 9.81s

CPU produces correct answer 7.4× faster than GPU. Same model, same prompt, same greedy decode (--temperature 0.0).

What this eliminates from prior hypotheses

  • ❌ Not an attention-math bug (CPU uses same attention math, works correctly)
  • ❌ Not Q4K dequant precision (both paths dequantize at load)
  • ❌ Not tokenizer/chat-template (same on both paths)
  • ❌ Not autoregressive/KV cache (CPU has both, works)
  • IS GPU kernel correctness (only varying piece)

The 1.4e-3 cosine drop at attention output that v1/v2/v3 found was real but benign — just Q4K dequant precision noise that doesn't flip argmax. The GIBBERISH from GPU is a separate, much larger error.

GPU surface

apr run GPU path diagnostic shows:

  • [trueno#243] Manual graph: 646 kernels — trueno_gpu manual graph
  • [wgpu] CPU fallback for lm_head — hybrid wgpu + CPU
  • [PMAT-333] Dequantized 337 weights, 28282.5 MB F32

Bug is in those 646 trueno kernels OR the wgpu dispatch.

Implications

  • SHIP-007 root cause is now localized to a specific surface (GPU path).
  • Workaround exists: ship MODEL-1 with --no-gpu default (CPU produces correct + faster output).
  • Fix path actionable: audit trueno_gpu kernels.

Five Whys (full detail in commit)

  1. Why GPU gibberish? Some kernel in trueno_gpu manual graph wrong.
  2. Why missed by v1-v4? They only instrumented forward_traced (CPU path).
  3. Why did attention bisection mislead? cos=0.998 noise was benign.
  4. Why does GPU exist if CPU works? Performance — but GPU is currently wrong AND slower.
  5. Why no test catches this? No CPU-vs-GPU output parity gate exists.

Test plan

  • Live ran both configurations on canonical 7B teacher
  • Single-token corroboration (--max-tokens 1)
  • Multi-token corroboration (--max-tokens 16)
  • Same prompt, same temperature, same model
  • Markdown only, no code changes
  • CI green
  • Auto-merge

🤖 Generated with Claude Code

…oken

Decisive test: same model, same prompt, same greedy decode. Only --no-gpu
flag differs.

| Configuration       | Output for "What is 2+2?"            | Time   |
|---------------------|--------------------------------------|--------|
| `apr run` (GPU)     | "ampiezza = 0.5\ndiametro = 10"      | 72.95s |
| `apr run --no-gpu`  | "2 + 2 equals 4."                    |  9.81s |

CPU is **CORRECT and 7.4× FASTER**. GPU is **GIBBERISH and slower**.

This eliminates every previous v1/v2/v3/v4 hypothesis:
- ❌ Not attention math (CPU uses same attention math, works)
- ❌ Not Q4K dequant precision (both paths dequantize at load)
- ❌ Not tokenizer/chat template (same on both paths)
- ❌ Not autoregressive/KV cache (CPU has both, works)
- ✅ IS GPU kernel correctness (only varying piece)

The GPU surface from `apr run` diagnostics:
- `[trueno#243] Manual graph: 646 kernels` — trueno_gpu kernel pipeline
- `[wgpu] CPU fallback for lm_head` — hybrid wgpu+CPU
- 28.3 GB F32 weights uploaded to GPU after Q4K dequant

The bug is in those 646 trueno kernels OR the wgpu dispatch.

## Five Whys
1. Why GPU gibberish? Some kernel in trueno_gpu manual graph computes
   attention or matmul incorrectly.
2. Why missed by v1/v2/v3? They only instrumented forward_traced (CPU
   path which IS correct).
3. Why did attention bisection look like bug? cos=0.998 attn_out drop
   was real but BENIGN — Q4K precision noise not flipping argmax.
4. Why does GPU exist if CPU works? Performance — but GPU is currently
   both wrong AND slower. Regression somewhere.
5. Why no test catches this? No CPU-vs-GPU output parity gate exists
   for the wgpu manual graph path. This regression slipped through.

Workaround: ship MODEL-1 with `--no-gpu` default OR fix GPU before ship.
Fix path is now actionable: audit trueno_gpu manual graph kernels.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 13:18
@noahgift noahgift merged commit 603696d into main May 3, 2026
11 checks passed
@noahgift noahgift deleted the evidence/ship-007-v5-gpu-path-confirmed branch May 3, 2026 13:42
noahgift added a commit that referenced this pull request May 3, 2026
…arity invariant (#1427)

Authors a new schema-kind contract that codifies the invariant uncovered by
the SHIP-007 v5 finding (PR #1426): for any model and prompt under greedy
decode (`--temperature 0.0`), `apr run` GPU output MUST match
`apr run --no-gpu` CPU output.

## Why now

The v5 falsifier showed:
- `apr run` GPU emits "ampiezza = 0.5\ndiametro = 10" (gibberish, 72.95s)
- `apr run --no-gpu` emits "2 + 2 equals 4." (correct, 9.81s)
Same model, same prompt, same temperature.

This regression slipped through because the existing parity_gate
(crates/aprender-serve/src/gguf/cuda/mod_parity_gate.rs) only fires for
`OwnedQuantizedModelCuda` construction (used by `apr parity` and
`apr run --force-gpu`). The default `apr run` path on .apr files goes
through `OwnedQuantizedModel::from_apr` → trueno manual graph (646
kernels), which has NO parity gate.

The code at `crates/aprender-serve/src/cli/apr_inference.rs:46` already
acknowledges: "Both AprF32ToGpuAdapter and forward_token_apr_q4k produce
garbage on GPU."

## Contract structure

- **kind**: schema (matches `apr-cli-trace-save-tensor-v1` pattern)
- **3 equations**: greedy_argmax_parity, cosine_parity, no_gpu_flag_honor
- **4 falsifiers**:
  - FALSIFY-CPU-GPU-001 (greedy argmax match) — PARTIAL, currently FALSIFIED in live data
  - FALSIFY-CPU-GPU-002 (cosine ≥ 0.99) — PARTIAL, apr parity reports cos=-0.005
  - FALSIFY-CPU-GPU-003 (parity gate enforced at apr run init) — PARTIAL, gap in trueno path
  - FALSIFY-CPU-GPU-004 (--no-gpu honored) — FUNCTIONAL, verified empirically
- **4 proof obligations**: invariant + completeness + liveness gates

## Five Whys

1. Why this contract? To codify a regression class so it can never slip
   through silently again.
2. Why didn't existing apr-gpu-parity-consistency-v1 cover it? That
   contract is about CLI scope-clarity (apr parity command output messages),
   not about runtime output correctness gates in apr run.
3. Why a new contract instead of extending apr-vs-gguf-forward-parity-v1?
   That one is about APR vs GGUF format parity (different concern).
4. Why PARTIAL on 3 of 4 falsifiers? FALSIFY-CPU-GPU-001/002/003 are all
   currently FALSIFIED in live data — the gate doesn't exist for the .apr
   trueno graph path. They're algorithm-bound to existing parity_gate code
   but functional discharge requires the implementation to extend coverage.
5. Why FUNCTIONAL on 004? Live tested 2026-05-03: --no-gpu produces correct
   "2 + 2 equals 4." with no [trueno#243] or [PMAT-082] log lines.

## Implementation follow-up (separate PR)

Extend mod_parity_gate.rs OR the OwnedQuantizedModel::from_apr loader path
to call parity_gate during the trueno graph init. The pattern is already
proven in mod.rs:268-279 (with SKIP_PARITY_GATE=1 escape hatch).

`pv validate contracts/apr-cpu-vs-gpu-output-parity-v1.yaml` returns
0 errors / 0 warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant