Skip to content

contract(apr-cpu-vs-gpu-output-parity-v1): codify CPU-vs-GPU output parity invariant#1427

Merged
noahgift merged 1 commit into
mainfrom
contract/apr-cpu-vs-gpu-output-parity-v1
May 3, 2026
Merged

contract(apr-cpu-vs-gpu-output-parity-v1): codify CPU-vs-GPU output parity invariant#1427
noahgift merged 1 commit into
mainfrom
contract/apr-cpu-vs-gpu-output-parity-v1

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Authors a new schema-kind contract contracts/apr-cpu-vs-gpu-output-parity-v1.yaml that codifies the invariant uncovered by the SHIP-007 v5 finding: for any model and prompt under greedy decode, apr run GPU output MUST match apr run --no-gpu CPU output.

This regression-class capture prevents future SHIP-007-like silent gibberish-on-GPU bugs.

v5 finding (motivation)

Configuration Output for "What is 2+2?" Time
apr run (GPU default) "ampiezza = 0.5\ndiametro = 10" 72.95s
apr run --no-gpu "2 + 2 equals 4." 9.81s

The bug slipped through because the existing parity_gate (mod_parity_gate.rs) only fires for OwnedQuantizedModelCuda construction (used by apr parity and apr run --force-gpu). The default apr run path on .apr files goes through OwnedQuantizedModel::from_apr → trueno manual graph (646 kernels), which has NO parity gate. The code comment at apr_inference.rs:46 even acknowledges: "Both AprF32ToGpuAdapter and forward_token_apr_q4k produce garbage on GPU."

Contract structure

  • kind: schema (matches existing apr-cli-trace-save-tensor-v1 pattern)
  • 3 equations: greedy_argmax_parity, cosine_parity, no_gpu_flag_honor
  • 4 falsifiers:
    • FALSIFY-CPU-GPU-001 — argmax match (PARTIAL_ALGORITHM_LEVEL, FALSIFIED in live data)
    • FALSIFY-CPU-GPU-002 — cosine ≥ 0.99 (PARTIAL, apr parity reports cos=-0.005)
    • FALSIFY-CPU-GPU-003 — gate enforced at apr run init (PARTIAL, gap on trueno path)
    • FALSIFY-CPU-GPU-004 — --no-gpu honored (FUNCTIONAL, verified empirically)
  • 4 proof obligations: invariant + completeness + liveness gates

pv validate

pv validate contracts/apr-cpu-vs-gpu-output-parity-v1.yaml returns 0 errors / 0 warnings.

Five Whys (full detail in commit)

  1. Why this contract? Codify regression class.
  2. Why not extend apr-gpu-parity-consistency-v1? That one is about CLI scope-clarity, not runtime correctness gates.
  3. Why not extend apr-vs-gguf-forward-parity-v1? Different concern (format parity vs backend parity).
  4. Why PARTIAL on 3 of 4? They're FALSIFIED in live data — the gate doesn't exist for the trueno path yet.
  5. Why FUNCTIONAL on 004? Verified empirically — --no-gpu produces correct output with no GPU log lines.

Implementation follow-up (separate PR)

Extend mod_parity_gate.rs to fire on OwnedQuantizedModel::from_apr GPU init. Pattern already proven in mod.rs:268-279.

Test plan

🤖 Generated with Claude Code

…arity invariant

Authors a new schema-kind contract that codifies the invariant uncovered by
the SHIP-007 v5 finding (PR #1426): for any model and prompt under greedy
decode (`--temperature 0.0`), `apr run` GPU output MUST match
`apr run --no-gpu` CPU output.

## Why now

The v5 falsifier showed:
- `apr run` GPU emits "ampiezza = 0.5\ndiametro = 10" (gibberish, 72.95s)
- `apr run --no-gpu` emits "2 + 2 equals 4." (correct, 9.81s)
Same model, same prompt, same temperature.

This regression slipped through because the existing parity_gate
(crates/aprender-serve/src/gguf/cuda/mod_parity_gate.rs) only fires for
`OwnedQuantizedModelCuda` construction (used by `apr parity` and
`apr run --force-gpu`). The default `apr run` path on .apr files goes
through `OwnedQuantizedModel::from_apr` → trueno manual graph (646
kernels), which has NO parity gate.

The code at `crates/aprender-serve/src/cli/apr_inference.rs:46` already
acknowledges: "Both AprF32ToGpuAdapter and forward_token_apr_q4k produce
garbage on GPU."

## Contract structure

- **kind**: schema (matches `apr-cli-trace-save-tensor-v1` pattern)
- **3 equations**: greedy_argmax_parity, cosine_parity, no_gpu_flag_honor
- **4 falsifiers**:
  - FALSIFY-CPU-GPU-001 (greedy argmax match) — PARTIAL, currently FALSIFIED in live data
  - FALSIFY-CPU-GPU-002 (cosine ≥ 0.99) — PARTIAL, apr parity reports cos=-0.005
  - FALSIFY-CPU-GPU-003 (parity gate enforced at apr run init) — PARTIAL, gap in trueno path
  - FALSIFY-CPU-GPU-004 (--no-gpu honored) — FUNCTIONAL, verified empirically
- **4 proof obligations**: invariant + completeness + liveness gates

## Five Whys

1. Why this contract? To codify a regression class so it can never slip
   through silently again.
2. Why didn't existing apr-gpu-parity-consistency-v1 cover it? That
   contract is about CLI scope-clarity (apr parity command output messages),
   not about runtime output correctness gates in apr run.
3. Why a new contract instead of extending apr-vs-gguf-forward-parity-v1?
   That one is about APR vs GGUF format parity (different concern).
4. Why PARTIAL on 3 of 4 falsifiers? FALSIFY-CPU-GPU-001/002/003 are all
   currently FALSIFIED in live data — the gate doesn't exist for the .apr
   trueno graph path. They're algorithm-bound to existing parity_gate code
   but functional discharge requires the implementation to extend coverage.
5. Why FUNCTIONAL on 004? Live tested 2026-05-03: --no-gpu produces correct
   "2 + 2 equals 4." with no [trueno#243] or [PMAT-082] log lines.

## Implementation follow-up (separate PR)

Extend mod_parity_gate.rs OR the OwnedQuantizedModel::from_apr loader path
to call parity_gate during the trueno graph init. The pattern is already
proven in mod.rs:268-279 (with SKIP_PARITY_GATE=1 escape hatch).

`pv validate contracts/apr-cpu-vs-gpu-output-parity-v1.yaml` returns
0 errors / 0 warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 13:50
@noahgift noahgift merged commit 040a7a1 into main May 3, 2026
11 checks passed
@noahgift noahgift deleted the contract/apr-cpu-vs-gpu-output-parity-v1 branch May 3, 2026 14:13
noahgift added a commit that referenced this pull request May 3, 2026
…vs-gpu-output-parity-v1` chain (PRs #1427-#1430) (#1431)

Canonical record of today's 4-PR session that lands a 3-layer jidoka armor
at the GPU-CPU dispatch boundary, closing §40's silent-gibberish loophole
*as a regression class* (separate from the underlying GPU kernel fix).

Net effect: default `apr run <model.apr>` on a SHIP-007-broken GPU build
now emits the full backend-fallback chain on stderr without `--verbose`:

  [apr-cpu-vs-gpu-output-parity-v1] CUDA path rejected, attempting fallback: ...
  Backend: wgpu (Vulkan)
  ...

so users always know which backend is actually serving their tokens — the
`--no-gpu` documented workaround is now self-evidently the correct path on
this build. MODEL-1 ship % nudges 80% → 87% because shipping `apr run` users
with the documented workaround is now jidoka-safe.

This section explicitly does NOT claim the SHIP-007 kernel bug is fixed —
that remains an open track per §40. What §41 codifies is that the failure
mode is now LOUD instead of SILENT.

Five Whys + table of all 4 PRs + next-session pickup list (FALSIFY-CPU-GPU-005
part b OR MODEL-2 distill-train scaffolding) included.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant