feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b by noahgift · Pull Request #1440 · paiml/aprender

noahgift · 2026-05-03T20:19:15Z

Summary

Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs` at module scope. The future wgpu cosine gate (FALSIFY-CPU-GPU-005 part b, contract v1.2.0 implementation_evidence line 201) can now call this helper without a `--features cuda` build dependency.

Helper

```rust
pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32
```

f64-accumulated for numerical stability
Fail-closed: returns 0.0 on length-mismatch / zero-norm / empty input → triggers fallback below 0.99 floor

Tests (3 pass)

`cpu_vs_gpu_cosine_similarity_parallel_returns_one` — positive case (no fallback)
`cpu_vs_gpu_cosine_similarity_orthogonal_returns_zero` — negative case (fallback triggers)
`cpu_vs_gpu_cosine_similarity_fails_closed` — conservative-default for bad input

Five Whys

Why lift now? Part b is the single piece of work that closes FALSIFY-CPU-GPU-005 from PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL. The helper is the smallest independently-landable piece.
Why not import from cuda? That module is `cfg(feature = "cuda")`-gated; the wgpu path is gated `cfg(feature = "gpu")` (not cuda) — importing would force both features for cosine math.
Why fail-closed? Per `feedback_fix_root_cause_never_route_around` and §40/§41 jidoka: the gate must NEVER ship silent gibberish. NaN/zero/wrong-length input → return 0.0 → user sees the fallback log.
Why 3 tests? Cosine surface has 3 failure modes (positive, negative, conservative). Each must be independently locked in.
Why bounded? ~80 LOC total. No behavior change (helper currently unused). Builds without `--features cuda`. Lays groundwork for part b PR (~100-150 LOC).

Net effect

FALSIFY-CPU-GPU-005 status unchanged (still PARTIAL_ALGORITHM_LEVEL)
MODEL-1 ship % unchanged at 88%
Discharge happens when part b's wgpu single-step decode lands and uses this helper

Test plan

`cargo test -p aprender-serve --lib cpu_vs_gpu_cosine` (3 pass)
CI green on required gates

🤖 Generated with Claude Code

…ty helper for FALSIFY-CPU-GPU-005 part b Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs` at module scope so the future wgpu cosine gate (predicted by contract v1.2.0 FALSIFY-CPU-GPU-005 part b implementation_evidence line 201) can compare a wgpu single-step decode against a CPU reference forward without taking a `--features cuda` build dependency. Helper signature: pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32 Numerically-stable f64-accumulated; fail-closed semantics: returns 0.0 on length-mismatch / zero-norm / empty input so the future gate TRIGGERS fallback (cosine 0.0 < 0.99 floor) rather than dividing by zero or panicking. 3 unit tests added in `mod tests`: - cpu_vs_gpu_cosine_similarity_parallel_returns_one — locks the positive case (gate must NOT trigger fallback when wgpu = CPU). - cpu_vs_gpu_cosine_similarity_orthogonal_returns_zero — locks the negative case (gate MUST trigger fallback when divergent). - cpu_vs_gpu_cosine_similarity_fails_closed — locks the conservative-default case for zero-norm / length-mismatch / empty. Five Whys 1. Why lift the cosine helper now? Because part b's implementation gap (per contract notes line 201-202) is the single piece of work that would close FALSIFY-CPU-GPU-005 from PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL. The helper is the smallest piece of that work that can land independently and without --features cuda dependency. 2. Why not just import from `cuda::mod_parity_gate`? That module is `cfg(feature = "cuda")`-gated; importing into the wgpu codepath (gated `cfg(feature = "gpu")`, NOT cuda) would force users to enable both features just to get the cosine math. 3. Why fail-closed on bad input? Per `feedback_fix_root_cause_never_route_around` and the spec §40+§41 jidoka pattern: the gate must NEVER ship silent gibberish. If the probe produces NaN/zeros/wrong-length data, the safe action is to return 0.0 (which fails the 0.99 floor) and let the user see the wgpu fallback log and re-run with --no-gpu. 4. Why 3 tests, not 1? The cosine surface has three failure modes (positive, negative, conservative-default). Each must be locked in independently — a refactor that touches only one branch must not silently weaken the others. 5. Why bounded? ~30 LOC helper + ~50 LOC tests = ~80 LOC total. No behavior change to the existing wgpu fallback path (helper is currently unused). Builds without --features cuda. Lays groundwork for the part b implementation PR (~100-150 LOC). Net effect - FALSIFY-CPU-GPU-005 status unchanged (still PARTIAL_ALGORITHM_LEVEL) but the cosine primitive needed for full discharge is now in place. - Coverage tally unchanged — this is infrastructure, not a new bind. - MODEL-1 ship % unchanged at 88%; the discharge happens when part b's wgpu single-step decode lands and uses this helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gpu cosine parity gate (#1442) Lands the wgpu cosine parity gate inline at try_apr_wgpu_inference (crates/aprender-serve/src/infer/gguf_gpu_generate.rs ~line 441-510), between kv_caches init and the autoregressive loop start. Closes the implementation gap that contract v1.2.0 documented as deferred. Algorithm (symmetric to FALSIFY-CPU-GPU-003 CUDA parity_gate): 1. Take input_tokens.first() as the probe token (typically BOS). 2. CPU reference logits via OwnedQuantizedModel::forward_single_with_cache with a tiny temporary OwnedQuantizedKVCache::from_config(cfg, 2) — gives reference logits without contaminating the real autoregressive cache. 3. wgpu single-step replay: same fwd.forward_layer code path the autoregressive loop uses, with a separate probe_kv_caches vec (max_seq=2). Output norm + LM head argmax math mirrors the loop body. 4. cpu_vs_gpu_cosine_similarity (helper from PR #1440 — module-scope, no --features cuda dep) → if !(cos.is_finite() && cos >= 0.99) emit WGPU_FALLBACK_LOG_PREFIX tagged stderr line and return None. 5. Probe error paths (CPU forward failure, wgpu probe layer failure) also emit the contract-tagged log + return None — fail-closed. Cost: one extra forward pass at init (~2-5ms on 7B), paid once per `apr run`, not per token. Real autoregressive kv_caches are NOT touched by the probe. Contract v1.2.0 → v1.3.0 ACTIVE: FALSIFY-CPU-GPU-005 algorithm_evidence updated to reference the implementation and the inline call site; v1.3.0 changelog entry added; status remains PARTIAL_ALGORITHM_LEVEL pending live broken-GPU smoke (~5min on canonical 7B teacher). Five Whys 1. Why land part b now? §43.6 (a) bounded next-best lever; cosine helper from #1440 unblocked the impl path with no --features cuda dependency. 2. Why inline, not extracted helper? Loop body is ~30 LOC; extracting a separate fn would either pass 8+ borrowed locals (max_seq, eps, vocab_size, hidden_dim, num_layers, output_norm, lm_head_f32, fwd) or wrap them in a struct that exists for a single call site. Inline block scope localizes the temporary probe_kv_caches and shadows `hidden`/`normed`/`wgpu_logits` cleanly. 3. Why fail-closed on probe errors (return None instead of propagating)? Per feedback_fix_root_cause_never_route_around + §40/§41 jidoka: the gate's job is to NEVER ship silent gibberish. CPU probe failure or wgpu kernel failure both indicate the wgpu path is unsafe — the correct user experience is fall-to-CPU with a tagged stderr line, not crash or hide. 4. Why max_seq=2 for probe caches? Probe runs at position 0 with a single token. max_seq=1 would work but max_seq=2 gives one slot of slack and matches the OwnedQuantizedKVCache::from_config minimum intuition (cap is "max forward window", not "exact"). 5. Why bounded? ~70 LOC inline + ~9 LOC contract YAML. Builds clean with --features gpu. 696 aprender-serve tests pass, 0 regressions. The 3 cosine helper unit tests from #1440 still cover the math primitive used here. Net effect - MODEL-1 ship %: 88% → 89% (silent-gibberish loophole closed at the wgpu init boundary; SHIP-007 GPU kernel root-cause fix remains separate per §40). - FALSIFY-CPU-GPU-005 status: PARTIAL_ALGORITHM_LEVEL (gate impl in place, live smoke deferred to a verification PR). - Contract: v1.2.0 → v1.3.0 ACTIVE. - pv validate exits 0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…osine helper for FALSIFY-CPU-GPU-005 part b (#1441) Canonical record of today's split-track cycle (PRs #1438-#1440). Maintains the §41/§42 amendment cadence — each /loop iteration that lands ≥3 PRs gets a single audit story. Chain landed: - #1438: FALSIFY-APR-DISTILL-TRAIN-005 PARTIAL_ALGORITHM_LEVEL (precompute byte-determinism, 2 unit tests, local + remote-stub branches) - #1439: FALSIFY-APR-DISTILL-TRAIN-006 PARTIAL_ALGORITHM_LEVEL (train cache-resume idempotency, 2 unit tests, negative + positive halves) - #1440: cpu_vs_gpu_cosine_similarity helper at module scope + 3 tests (parallel=1, orthogonal=0, fail-closed; cosine math now callable without --features cuda for the future part b wgpu cosine gate) §43 documents: what landed (table), coverage flips (TRAIN-005, TRAIN-006 unbound → PARTIAL_ALGORITHM_LEVEL), why for MODEL-1+MODEL-2 (parallel contract drift closure + part b infrastructure), Five Whys, ship % effects (MODEL-1 87→88, MODEL-2 54→56), and next-session pickup options (CPU-GPU-005 part b OR distill-train real implementation). Coverage tally: 15+33 → 15+35 (+2 PARTIAL_ALGORITHM_LEVEL closed). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…+ distill-train 9/9 sweep close (#1444) Canonical record of today's continuation cycle (PRs #1442 + #1443). Closes the two §43.6 next-session pickup items in one v2.89.0 amendment. Chain landed (post-§43 v2.88.0): - #1442: FALSIFY-CPU-GPU-005 part b implementation ~70 LOC inline at try_apr_wgpu_inference (gguf_gpu_generate.rs ~441-510). Probe-token CPU forward via OwnedQuantizedModel::forward_single_with_cache (tiny max_seq=2 cache) + wgpu single-step replay using the same fwd.forward_layer code path the autoregressive loop uses + cosine compare via cpu_vs_gpu_cosine_similarity (helper from #1440). < 0.99 → emit WGPU_FALLBACK_LOG_PREFIX + return None. Probe error paths fail-closed. Symmetric to §41 CUDA parity_gate. Contract apr-cpu-vs-gpu-output-parity-v1 v1.2.0 → v1.3.0 ACTIVE. - #1443: distill-train 9/9 falsifier sweep close TRAIN-007 PARTIAL via pv validate (live: 0 errors / 0 warnings). TRAIN-008 PARTIAL via cargo test cli_commands registered_commands (live: 1 pass; test_no_unregistered_commands enforces the 3-surface invariant per feedback_cli_subcommand_three_surface_drift). TRAIN-009 BLOCKER_FIXTURE_ABSENT pending §35 real-training impl (no val_loss to compare without gradient descent). All 9 TRAIN-* falsifiers now have explicit algorithm_evidence blocks (8× PARTIAL_ALGORITHM_LEVEL + 1× BLOCKER_FIXTURE_ABSENT) — the distill contract has reached terminal-binding state. §44 documents: what landed (table), coverage flips (FALSIFY-CPU-GPU-005 PARTIAL→PARTIAL deeper, TRAIN-007/008 unbound→PARTIAL, TRAIN-009 unbound→BLOCKER), why for MODEL-1+MODEL-2 (jidoka armor complete + distill contract terminal-bound), Five Whys, ship % effects (MODEL-1 88→89, MODEL-2 56→57), and next-session pickup options (live FALSIFY-CPU-GPU-005 discharge OR MODEL-2 §35 real-training OR MODEL-1 SHIP-007 GPU kernel root-cause fix). Coverage tally: 15+35 → 15+37 (+2 PARTIAL closed; TRAIN-009 blocked). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 20:19

Merge branch 'main' into feat/cpu-gpu-005-wgpu-cosine-parity-gate

e442b66

noahgift mentioned this pull request May 3, 2026

spec(ship-two-models): v2.88.0 — §43 distill-train algorithm-bind + cosine helper for FALSIFY-CPU-GPU-005 part b #1441

Merged

1 task

Merge branch 'main' into feat/cpu-gpu-005-wgpu-cosine-parity-gate

a363d33

noahgift merged commit b9a5f3c into main May 3, 2026
10 checks passed

noahgift deleted the feat/cpu-gpu-005-wgpu-cosine-parity-gate branch May 3, 2026 21:34

noahgift mentioned this pull request May 3, 2026

feat(apr-cpu-vs-gpu-output-parity-v1): FALSIFY-CPU-GPU-005 part b — wgpu cosine parity gate #1442

Merged

5 tasks

noahgift mentioned this pull request May 3, 2026

spec(ship-two-models): v2.89.0 — §44 FALSIFY-CPU-GPU-005 part b impl + distill-train 9/9 sweep close #1444

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b#1440

feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b#1440
noahgift merged 3 commits into
mainfrom
feat/cpu-gpu-005-wgpu-cosine-parity-gate

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

Helper

Tests (3 pass)

Five Whys

Net effect

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant