Skip to content

feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b#1440

Merged
noahgift merged 3 commits into
mainfrom
feat/cpu-gpu-005-wgpu-cosine-parity-gate
May 3, 2026
Merged

feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b#1440
noahgift merged 3 commits into
mainfrom
feat/cpu-gpu-005-wgpu-cosine-parity-gate

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs` at module scope. The future wgpu cosine gate (FALSIFY-CPU-GPU-005 part b, contract v1.2.0 implementation_evidence line 201) can now call this helper without a `--features cuda` build dependency.

Helper

```rust
pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32
```

  • f64-accumulated for numerical stability
  • Fail-closed: returns 0.0 on length-mismatch / zero-norm / empty input → triggers fallback below 0.99 floor

Tests (3 pass)

  • `cpu_vs_gpu_cosine_similarity_parallel_returns_one` — positive case (no fallback)
  • `cpu_vs_gpu_cosine_similarity_orthogonal_returns_zero` — negative case (fallback triggers)
  • `cpu_vs_gpu_cosine_similarity_fails_closed` — conservative-default for bad input

Five Whys

  1. Why lift now? Part b is the single piece of work that closes FALSIFY-CPU-GPU-005 from PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL. The helper is the smallest independently-landable piece.
  2. Why not import from cuda? That module is `cfg(feature = "cuda")`-gated; the wgpu path is gated `cfg(feature = "gpu")` (not cuda) — importing would force both features for cosine math.
  3. Why fail-closed? Per `feedback_fix_root_cause_never_route_around` and §40/§41 jidoka: the gate must NEVER ship silent gibberish. NaN/zero/wrong-length input → return 0.0 → user sees the fallback log.
  4. Why 3 tests? Cosine surface has 3 failure modes (positive, negative, conservative). Each must be independently locked in.
  5. Why bounded? ~80 LOC total. No behavior change (helper currently unused). Builds without `--features cuda`. Lays groundwork for part b PR (~100-150 LOC).

Net effect

  • FALSIFY-CPU-GPU-005 status unchanged (still PARTIAL_ALGORITHM_LEVEL)
  • MODEL-1 ship % unchanged at 88%
  • Discharge happens when part b's wgpu single-step decode lands and uses this helper

Test plan

  • `cargo test -p aprender-serve --lib cpu_vs_gpu_cosine` (3 pass)
  • CI green on required gates

🤖 Generated with Claude Code

…ty helper for FALSIFY-CPU-GPU-005 part b

Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which
lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs`
at module scope so the future wgpu cosine gate (predicted by contract
v1.2.0 FALSIFY-CPU-GPU-005 part b implementation_evidence line 201) can
compare a wgpu single-step decode against a CPU reference forward
without taking a `--features cuda` build dependency.

Helper signature:
  pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32

Numerically-stable f64-accumulated; fail-closed semantics: returns 0.0
on length-mismatch / zero-norm / empty input so the future gate
TRIGGERS fallback (cosine 0.0 < 0.99 floor) rather than dividing by
zero or panicking.

3 unit tests added in `mod tests`:
- cpu_vs_gpu_cosine_similarity_parallel_returns_one — locks the
  positive case (gate must NOT trigger fallback when wgpu = CPU).
- cpu_vs_gpu_cosine_similarity_orthogonal_returns_zero — locks the
  negative case (gate MUST trigger fallback when divergent).
- cpu_vs_gpu_cosine_similarity_fails_closed — locks the
  conservative-default case for zero-norm / length-mismatch / empty.

Five Whys
1. Why lift the cosine helper now? Because part b's implementation gap
   (per contract notes line 201-202) is the single piece of work that
   would close FALSIFY-CPU-GPU-005 from PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL.
   The helper is the smallest piece of that work that can land
   independently and without --features cuda dependency.
2. Why not just import from `cuda::mod_parity_gate`? That module is
   `cfg(feature = "cuda")`-gated; importing into the wgpu codepath
   (gated `cfg(feature = "gpu")`, NOT cuda) would force users to enable
   both features just to get the cosine math.
3. Why fail-closed on bad input? Per `feedback_fix_root_cause_never_route_around`
   and the spec §40+§41 jidoka pattern: the gate must NEVER ship silent
   gibberish. If the probe produces NaN/zeros/wrong-length data, the
   safe action is to return 0.0 (which fails the 0.99 floor) and let
   the user see the wgpu fallback log and re-run with --no-gpu.
4. Why 3 tests, not 1? The cosine surface has three failure modes
   (positive, negative, conservative-default). Each must be locked in
   independently — a refactor that touches only one branch must not
   silently weaken the others.
5. Why bounded? ~30 LOC helper + ~50 LOC tests = ~80 LOC total. No
   behavior change to the existing wgpu fallback path (helper is
   currently unused). Builds without --features cuda. Lays groundwork
   for the part b implementation PR (~100-150 LOC).

Net effect
- FALSIFY-CPU-GPU-005 status unchanged (still PARTIAL_ALGORITHM_LEVEL)
  but the cosine primitive needed for full discharge is now in place.
- Coverage tally unchanged — this is infrastructure, not a new bind.
- MODEL-1 ship % unchanged at 88%; the discharge happens when part b's
  wgpu single-step decode lands and uses this helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 20:19
@noahgift noahgift merged commit b9a5f3c into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the feat/cpu-gpu-005-wgpu-cosine-parity-gate branch May 3, 2026 21:34
noahgift added a commit that referenced this pull request May 3, 2026
…gpu cosine parity gate (#1442)

Lands the wgpu cosine parity gate inline at try_apr_wgpu_inference
(crates/aprender-serve/src/infer/gguf_gpu_generate.rs ~line 441-510),
between kv_caches init and the autoregressive loop start. Closes the
implementation gap that contract v1.2.0 documented as deferred.

Algorithm (symmetric to FALSIFY-CPU-GPU-003 CUDA parity_gate):
1. Take input_tokens.first() as the probe token (typically BOS).
2. CPU reference logits via OwnedQuantizedModel::forward_single_with_cache
   with a tiny temporary OwnedQuantizedKVCache::from_config(cfg, 2) — gives
   reference logits without contaminating the real autoregressive cache.
3. wgpu single-step replay: same fwd.forward_layer code path the
   autoregressive loop uses, with a separate probe_kv_caches vec
   (max_seq=2). Output norm + LM head argmax math mirrors the loop body.
4. cpu_vs_gpu_cosine_similarity (helper from PR #1440 — module-scope, no
   --features cuda dep) → if !(cos.is_finite() && cos >= 0.99) emit
   WGPU_FALLBACK_LOG_PREFIX tagged stderr line and return None.
5. Probe error paths (CPU forward failure, wgpu probe layer failure)
   also emit the contract-tagged log + return None — fail-closed.

Cost: one extra forward pass at init (~2-5ms on 7B), paid once per
`apr run`, not per token. Real autoregressive kv_caches are NOT
touched by the probe.

Contract v1.2.0 → v1.3.0 ACTIVE: FALSIFY-CPU-GPU-005 algorithm_evidence
updated to reference the implementation and the inline call site;
v1.3.0 changelog entry added; status remains PARTIAL_ALGORITHM_LEVEL
pending live broken-GPU smoke (~5min on canonical 7B teacher).

Five Whys
1. Why land part b now? §43.6 (a) bounded next-best lever; cosine
   helper from #1440 unblocked the impl path with no --features cuda
   dependency.
2. Why inline, not extracted helper? Loop body is ~30 LOC; extracting
   a separate fn would either pass 8+ borrowed locals (max_seq, eps,
   vocab_size, hidden_dim, num_layers, output_norm, lm_head_f32, fwd)
   or wrap them in a struct that exists for a single call site. Inline
   block scope localizes the temporary probe_kv_caches and shadows
   `hidden`/`normed`/`wgpu_logits` cleanly.
3. Why fail-closed on probe errors (return None instead of propagating)?
   Per feedback_fix_root_cause_never_route_around + §40/§41 jidoka:
   the gate's job is to NEVER ship silent gibberish. CPU probe failure
   or wgpu kernel failure both indicate the wgpu path is unsafe — the
   correct user experience is fall-to-CPU with a tagged stderr line,
   not crash or hide.
4. Why max_seq=2 for probe caches? Probe runs at position 0 with a
   single token. max_seq=1 would work but max_seq=2 gives one slot of
   slack and matches the OwnedQuantizedKVCache::from_config minimum
   intuition (cap is "max forward window", not "exact").
5. Why bounded? ~70 LOC inline + ~9 LOC contract YAML. Builds clean
   with --features gpu. 696 aprender-serve tests pass, 0 regressions.
   The 3 cosine helper unit tests from #1440 still cover the math
   primitive used here.

Net effect
- MODEL-1 ship %: 88% → 89% (silent-gibberish loophole closed at the
  wgpu init boundary; SHIP-007 GPU kernel root-cause fix remains
  separate per §40).
- FALSIFY-CPU-GPU-005 status: PARTIAL_ALGORITHM_LEVEL (gate impl in
  place, live smoke deferred to a verification PR).
- Contract: v1.2.0 → v1.3.0 ACTIVE.
- pv validate exits 0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…osine helper for FALSIFY-CPU-GPU-005 part b (#1441)

Canonical record of today's split-track cycle (PRs #1438-#1440).
Maintains the §41/§42 amendment cadence — each /loop iteration that
lands ≥3 PRs gets a single audit story.

Chain landed:
- #1438: FALSIFY-APR-DISTILL-TRAIN-005 PARTIAL_ALGORITHM_LEVEL
  (precompute byte-determinism, 2 unit tests, local + remote-stub
  branches)
- #1439: FALSIFY-APR-DISTILL-TRAIN-006 PARTIAL_ALGORITHM_LEVEL
  (train cache-resume idempotency, 2 unit tests, negative + positive
  halves)
- #1440: cpu_vs_gpu_cosine_similarity helper at module scope + 3 tests
  (parallel=1, orthogonal=0, fail-closed; cosine math now callable
  without --features cuda for the future part b wgpu cosine gate)

§43 documents: what landed (table), coverage flips (TRAIN-005, TRAIN-006
unbound → PARTIAL_ALGORITHM_LEVEL), why for MODEL-1+MODEL-2 (parallel
contract drift closure + part b infrastructure), Five Whys, ship %
effects (MODEL-1 87→88, MODEL-2 54→56), and next-session pickup
options (CPU-GPU-005 part b OR distill-train real implementation).

Coverage tally: 15+33 → 15+35 (+2 PARTIAL_ALGORITHM_LEVEL closed).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…+ distill-train 9/9 sweep close (#1444)

Canonical record of today's continuation cycle (PRs #1442 + #1443).
Closes the two §43.6 next-session pickup items in one v2.89.0 amendment.

Chain landed (post-§43 v2.88.0):
- #1442: FALSIFY-CPU-GPU-005 part b implementation
  ~70 LOC inline at try_apr_wgpu_inference (gguf_gpu_generate.rs
  ~441-510). Probe-token CPU forward via
  OwnedQuantizedModel::forward_single_with_cache (tiny max_seq=2
  cache) + wgpu single-step replay using the same fwd.forward_layer
  code path the autoregressive loop uses + cosine compare via
  cpu_vs_gpu_cosine_similarity (helper from #1440). < 0.99 → emit
  WGPU_FALLBACK_LOG_PREFIX + return None. Probe error paths
  fail-closed. Symmetric to §41 CUDA parity_gate. Contract
  apr-cpu-vs-gpu-output-parity-v1 v1.2.0 → v1.3.0 ACTIVE.

- #1443: distill-train 9/9 falsifier sweep close
  TRAIN-007 PARTIAL via pv validate (live: 0 errors / 0 warnings).
  TRAIN-008 PARTIAL via cargo test cli_commands registered_commands
  (live: 1 pass; test_no_unregistered_commands enforces the 3-surface
  invariant per feedback_cli_subcommand_three_surface_drift).
  TRAIN-009 BLOCKER_FIXTURE_ABSENT pending §35 real-training impl
  (no val_loss to compare without gradient descent).
  All 9 TRAIN-* falsifiers now have explicit algorithm_evidence
  blocks (8× PARTIAL_ALGORITHM_LEVEL + 1× BLOCKER_FIXTURE_ABSENT) —
  the distill contract has reached terminal-binding state.

§44 documents: what landed (table), coverage flips (FALSIFY-CPU-GPU-005
PARTIAL→PARTIAL deeper, TRAIN-007/008 unbound→PARTIAL, TRAIN-009
unbound→BLOCKER), why for MODEL-1+MODEL-2 (jidoka armor complete +
distill contract terminal-bound), Five Whys, ship % effects (MODEL-1
88→89, MODEL-2 56→57), and next-session pickup options (live
FALSIFY-CPU-GPU-005 discharge OR MODEL-2 §35 real-training OR
MODEL-1 SHIP-007 GPU kernel root-cause fix).

Coverage tally: 15+35 → 15+37 (+2 PARTIAL closed; TRAIN-009 blocked).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant