feat(apr-cpu-vs-gpu-output-parity-v1): cpu_vs_gpu_cosine_similarity helper for FALSIFY-CPU-GPU-005 part b#1440
Merged
Conversation
…ty helper for FALSIFY-CPU-GPU-005 part b Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs` at module scope so the future wgpu cosine gate (predicted by contract v1.2.0 FALSIFY-CPU-GPU-005 part b implementation_evidence line 201) can compare a wgpu single-step decode against a CPU reference forward without taking a `--features cuda` build dependency. Helper signature: pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32 Numerically-stable f64-accumulated; fail-closed semantics: returns 0.0 on length-mismatch / zero-norm / empty input so the future gate TRIGGERS fallback (cosine 0.0 < 0.99 floor) rather than dividing by zero or panicking. 3 unit tests added in `mod tests`: - cpu_vs_gpu_cosine_similarity_parallel_returns_one — locks the positive case (gate must NOT trigger fallback when wgpu = CPU). - cpu_vs_gpu_cosine_similarity_orthogonal_returns_zero — locks the negative case (gate MUST trigger fallback when divergent). - cpu_vs_gpu_cosine_similarity_fails_closed — locks the conservative-default case for zero-norm / length-mismatch / empty. Five Whys 1. Why lift the cosine helper now? Because part b's implementation gap (per contract notes line 201-202) is the single piece of work that would close FALSIFY-CPU-GPU-005 from PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL. The helper is the smallest piece of that work that can land independently and without --features cuda dependency. 2. Why not just import from `cuda::mod_parity_gate`? That module is `cfg(feature = "cuda")`-gated; importing into the wgpu codepath (gated `cfg(feature = "gpu")`, NOT cuda) would force users to enable both features just to get the cosine math. 3. Why fail-closed on bad input? Per `feedback_fix_root_cause_never_route_around` and the spec §40+§41 jidoka pattern: the gate must NEVER ship silent gibberish. If the probe produces NaN/zeros/wrong-length data, the safe action is to return 0.0 (which fails the 0.99 floor) and let the user see the wgpu fallback log and re-run with --no-gpu. 4. Why 3 tests, not 1? The cosine surface has three failure modes (positive, negative, conservative-default). Each must be locked in independently — a refactor that touches only one branch must not silently weaken the others. 5. Why bounded? ~30 LOC helper + ~50 LOC tests = ~80 LOC total. No behavior change to the existing wgpu fallback path (helper is currently unused). Builds without --features cuda. Lays groundwork for the part b implementation PR (~100-150 LOC). Net effect - FALSIFY-CPU-GPU-005 status unchanged (still PARTIAL_ALGORITHM_LEVEL) but the cosine primitive needed for full discharge is now in place. - Coverage tally unchanged — this is infrastructure, not a new bind. - MODEL-1 ship % unchanged at 88%; the discharge happens when part b's wgpu single-step decode lands and uses this helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 task
Merged
5 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…gpu cosine parity gate (#1442) Lands the wgpu cosine parity gate inline at try_apr_wgpu_inference (crates/aprender-serve/src/infer/gguf_gpu_generate.rs ~line 441-510), between kv_caches init and the autoregressive loop start. Closes the implementation gap that contract v1.2.0 documented as deferred. Algorithm (symmetric to FALSIFY-CPU-GPU-003 CUDA parity_gate): 1. Take input_tokens.first() as the probe token (typically BOS). 2. CPU reference logits via OwnedQuantizedModel::forward_single_with_cache with a tiny temporary OwnedQuantizedKVCache::from_config(cfg, 2) — gives reference logits without contaminating the real autoregressive cache. 3. wgpu single-step replay: same fwd.forward_layer code path the autoregressive loop uses, with a separate probe_kv_caches vec (max_seq=2). Output norm + LM head argmax math mirrors the loop body. 4. cpu_vs_gpu_cosine_similarity (helper from PR #1440 — module-scope, no --features cuda dep) → if !(cos.is_finite() && cos >= 0.99) emit WGPU_FALLBACK_LOG_PREFIX tagged stderr line and return None. 5. Probe error paths (CPU forward failure, wgpu probe layer failure) also emit the contract-tagged log + return None — fail-closed. Cost: one extra forward pass at init (~2-5ms on 7B), paid once per `apr run`, not per token. Real autoregressive kv_caches are NOT touched by the probe. Contract v1.2.0 → v1.3.0 ACTIVE: FALSIFY-CPU-GPU-005 algorithm_evidence updated to reference the implementation and the inline call site; v1.3.0 changelog entry added; status remains PARTIAL_ALGORITHM_LEVEL pending live broken-GPU smoke (~5min on canonical 7B teacher). Five Whys 1. Why land part b now? §43.6 (a) bounded next-best lever; cosine helper from #1440 unblocked the impl path with no --features cuda dependency. 2. Why inline, not extracted helper? Loop body is ~30 LOC; extracting a separate fn would either pass 8+ borrowed locals (max_seq, eps, vocab_size, hidden_dim, num_layers, output_norm, lm_head_f32, fwd) or wrap them in a struct that exists for a single call site. Inline block scope localizes the temporary probe_kv_caches and shadows `hidden`/`normed`/`wgpu_logits` cleanly. 3. Why fail-closed on probe errors (return None instead of propagating)? Per feedback_fix_root_cause_never_route_around + §40/§41 jidoka: the gate's job is to NEVER ship silent gibberish. CPU probe failure or wgpu kernel failure both indicate the wgpu path is unsafe — the correct user experience is fall-to-CPU with a tagged stderr line, not crash or hide. 4. Why max_seq=2 for probe caches? Probe runs at position 0 with a single token. max_seq=1 would work but max_seq=2 gives one slot of slack and matches the OwnedQuantizedKVCache::from_config minimum intuition (cap is "max forward window", not "exact"). 5. Why bounded? ~70 LOC inline + ~9 LOC contract YAML. Builds clean with --features gpu. 696 aprender-serve tests pass, 0 regressions. The 3 cosine helper unit tests from #1440 still cover the math primitive used here. Net effect - MODEL-1 ship %: 88% → 89% (silent-gibberish loophole closed at the wgpu init boundary; SHIP-007 GPU kernel root-cause fix remains separate per §40). - FALSIFY-CPU-GPU-005 status: PARTIAL_ALGORITHM_LEVEL (gate impl in place, live smoke deferred to a verification PR). - Contract: v1.2.0 → v1.3.0 ACTIVE. - pv validate exits 0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…osine helper for FALSIFY-CPU-GPU-005 part b (#1441) Canonical record of today's split-track cycle (PRs #1438-#1440). Maintains the §41/§42 amendment cadence — each /loop iteration that lands ≥3 PRs gets a single audit story. Chain landed: - #1438: FALSIFY-APR-DISTILL-TRAIN-005 PARTIAL_ALGORITHM_LEVEL (precompute byte-determinism, 2 unit tests, local + remote-stub branches) - #1439: FALSIFY-APR-DISTILL-TRAIN-006 PARTIAL_ALGORITHM_LEVEL (train cache-resume idempotency, 2 unit tests, negative + positive halves) - #1440: cpu_vs_gpu_cosine_similarity helper at module scope + 3 tests (parallel=1, orthogonal=0, fail-closed; cosine math now callable without --features cuda for the future part b wgpu cosine gate) §43 documents: what landed (table), coverage flips (TRAIN-005, TRAIN-006 unbound → PARTIAL_ALGORITHM_LEVEL), why for MODEL-1+MODEL-2 (parallel contract drift closure + part b infrastructure), Five Whys, ship % effects (MODEL-1 87→88, MODEL-2 54→56), and next-session pickup options (CPU-GPU-005 part b OR distill-train real implementation). Coverage tally: 15+33 → 15+35 (+2 PARTIAL_ALGORITHM_LEVEL closed). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 task
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…+ distill-train 9/9 sweep close (#1444) Canonical record of today's continuation cycle (PRs #1442 + #1443). Closes the two §43.6 next-session pickup items in one v2.89.0 amendment. Chain landed (post-§43 v2.88.0): - #1442: FALSIFY-CPU-GPU-005 part b implementation ~70 LOC inline at try_apr_wgpu_inference (gguf_gpu_generate.rs ~441-510). Probe-token CPU forward via OwnedQuantizedModel::forward_single_with_cache (tiny max_seq=2 cache) + wgpu single-step replay using the same fwd.forward_layer code path the autoregressive loop uses + cosine compare via cpu_vs_gpu_cosine_similarity (helper from #1440). < 0.99 → emit WGPU_FALLBACK_LOG_PREFIX + return None. Probe error paths fail-closed. Symmetric to §41 CUDA parity_gate. Contract apr-cpu-vs-gpu-output-parity-v1 v1.2.0 → v1.3.0 ACTIVE. - #1443: distill-train 9/9 falsifier sweep close TRAIN-007 PARTIAL via pv validate (live: 0 errors / 0 warnings). TRAIN-008 PARTIAL via cargo test cli_commands registered_commands (live: 1 pass; test_no_unregistered_commands enforces the 3-surface invariant per feedback_cli_subcommand_three_surface_drift). TRAIN-009 BLOCKER_FIXTURE_ABSENT pending §35 real-training impl (no val_loss to compare without gradient descent). All 9 TRAIN-* falsifiers now have explicit algorithm_evidence blocks (8× PARTIAL_ALGORITHM_LEVEL + 1× BLOCKER_FIXTURE_ABSENT) — the distill contract has reached terminal-binding state. §44 documents: what landed (table), coverage flips (FALSIFY-CPU-GPU-005 PARTIAL→PARTIAL deeper, TRAIN-007/008 unbound→PARTIAL, TRAIN-009 unbound→BLOCKER), why for MODEL-1+MODEL-2 (jidoka armor complete + distill contract terminal-bound), Five Whys, ship % effects (MODEL-1 88→89, MODEL-2 56→57), and next-session pickup options (live FALSIFY-CPU-GPU-005 discharge OR MODEL-2 §35 real-training OR MODEL-1 SHIP-007 GPU kernel root-cause fix). Coverage tally: 15+35 → 15+37 (+2 PARTIAL closed; TRAIN-009 blocked). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lifts the cosine-similarity primitive from `cuda::mod_parity_gate` (which lives behind `cfg(feature = "cuda")`) into `infer/gguf_gpu_generate.rs` at module scope. The future wgpu cosine gate (FALSIFY-CPU-GPU-005 part b, contract v1.2.0 implementation_evidence line 201) can now call this helper without a `--features cuda` build dependency.
Helper
```rust
pub(crate) fn cpu_vs_gpu_cosine_similarity(a: &[f32], b: &[f32]) -> f32
```
Tests (3 pass)
Five Whys
Net effect
Test plan
🤖 Generated with Claude Code