test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583) by noahgift · Pull Request #1745 · paiml/aprender

noahgift · 2026-05-17T11:02:45Z

Summary

PR-3g of the #1583 M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier — runs 4 canonical prompts through both CPU and GPU full forwards, prints the argmax agreement table + verdict.

Hardware result (lambda-vector RTX 4090, 2026-05-17)

PROMPT             | CPU argmax (val)     | GPU argmax (val)
canonical_3tok     |    944 ( 13.7270)  |    944 ( 14.4133)  ✓
single_tok_785     |    220 ( 15.5523)  |     25 ( 18.5098)  ✗ MISMATCH
multi_tok_short    |    315 ( 26.2279)  |    315 ( 25.5230)  ✓
multi_tok_code     |    198 ( 17.7453)  |    198 ( 17.8433)  ✓

3/4 prompts agree, 1 disagrees. L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU intermediates).

Cascade falsifier sequence

The full cascade is now:

PR	Result	Why this fix didn't close L47
PR-2 #1737	✅ closed 47/48 layers	fp64 q6k_gemv acc was necessary but not sufficient
PR-3d	H(i) FALSIFIED	qtypes are identical L0/L46/L47
PR-3e2 #1743	H(ii) CONFIRMED	2 of 8 experts swap at L47
PR-3f1	FALSIFIED	drift is upstream of softmax
PR-3f2	FALSIFIED	drift is upstream of weighted-sum
PR-3g (this)	L47 NOT BENIGN	argmax flips on 1/4 prompts
PR-3h (next)	pending	Option C: fp64 in per-expert SwiGLU

By elimination, the drift source must be inside each per-expert SwiGLU's intermediate chain: silu(gate) × up at f32 + the hidden-dim×4 intermediate state at f32 (the down-proj q6k_gemv acc is already fp64 thanks to PR-2).

What this PR adds

crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs:

new test falsify_qw3_moe_gpu_argmax_agreement — multi-prompt PROBE that builds CPU + GPU models once, runs 4 canonical prompts, prints agreement table + verdict (BENIGN / NOT BENIGN + diff). Same #[ignore] + #[cfg(feature = \"cuda\")] gates as siblings.

Test plan

Compiles with --features cuda
Runs on RTX 4090 in 24.45s
Verdict line classifies BENIGN / NOT BENIGN
Prints disagreement details so PR-3h knows which prompts to test against
PROBE not hard-fail (no assertion)

Reproduction

cargo test --release --features cuda \
  -p aprender-serve --test qwen3_moe_gpu_parity \
  falsify_qw3_moe_gpu_argmax_agreement \
  -- --ignored --nocapture

🤖 Generated with Claude Code

…GN (#1583) PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier, runs 4 canonical prompts through both CPU and GPU full forwards, and asserts argmax agreement. ## Result (lambda-vector RTX 4090, 2026-05-17) PROMPT | CPU argmax (val) | GPU argmax (val) canonical_3tok | 944 ( 13.7270) | 944 ( 14.4133) ✓ single_tok_785 | 220 ( 15.5523) | 25 ( 18.5098) ✗ MISMATCH multi_tok_short | 315 ( 26.2279) | 315 ( 25.5230) ✓ multi_tok_code | 198 ( 17.7453) | 198 ( 17.8433) ✓ **3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU). ## What this PR adds crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs: + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt probe that builds CPU + GPU models once, runs 4 canonical prompts through both full forwards, and prints argmax agreement table + verdict. PROBE not hard-assert; prints "BENIGN" if all agree or "NOT BENIGN" + disagreeing prompts otherwise. ## Cascade context - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify — 47/48 PASS, L47 surfaces - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap) - PR-3f1 ❌ falsified (fp64 softmax) — dropped - PR-3f2 ❌ falsified (f64 weighted-sum) — dropped - PR-3g ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix - PR-3h pending — Option C fp64 in per-expert SwiGLU intermediates ## Why the cascade kept eliminating candidates The 3-falsifier sequence ruled out the "easy" fix locations: 1. PR-3f1 (gate softmax precision) — drift upstream of softmax 2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum 3. **Remaining**: drift inside each per-expert SwiGLU's intermediate chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc which PR-2 already promoted to fp64) PR-3h must promote the silu(gate) × up element-wise multiply and the hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU and CUDA expert_swiglu helpers. ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_gpu_parity \ falsify_qw3_moe_gpu_argmax_agreement \ -- --ignored --nocapture ~25s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN (#1747) * docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.0 → v1.7.1 — M-GPU-MOE-3 PR-2 verified, L47 surfaced Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main (#1737, 88ce47f — q6k_gemv fp64 accumulators). PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in GitHub comment on #1583 (issuecomment-4470195446). The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46, v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a real win. L47 was previously undetected because no per-layer falsifier existed in-tree; PR-1 of this cascade (#1713) closed that gap and surfaced the L47 anomaly. WHAT FLIPS: metadata.version 1.7.0 → 1.7.1 bottom-of-file version: "1.7.0" → "1.7.1" bottom-of-file status comment refreshed: "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING" → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING" AC_GPU_MOE_001 stage status text refresh (text-only — not yet refactored into a new amendment_history entry since this PR is scoped to the v1.7.1 amendment block only). WHAT STAYS PENDING: - L47 single-layer cascade — root cause unknown. Three candidate hypotheses captured in the v1.7.1 amendment block (qtype mismatch, MoE expert distribution, stride/shape boundary). Forthcoming PR-3c surfaces §85 (or next-available section) covering the L47 cascade. Forthcoming PR-3d+: per-tensor histogram on L47 before authoring fix. - M-GPU-MOE-2 (wgpu fallback) — unchanged - M-GPU-MOE-3 PR-4 throughput — unchanged YAML-ONLY: Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade. After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root cause and verify user-visible impact. OUTCOME PR-3 ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236 PR-3d ❌ H(i) qtype-mismatch FALSIFIED PR-3e ✅ #1741 — L47 first divergent router (cos 0.9926) PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47 PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream PR-3g ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees → L47 NOT BENIGN (~25% prompt-dependent impact) ROOT CAUSE (by elimination) Per-expert SwiGLU f32 intermediates: 1. gate_proj @ hidden ← fp64 acc thanks to PR-2 ✅ 2. silu(gate) ← f32 ✗ 3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗ 4. down_proj @ above ← fp64 acc thanks to PR-2 ✅ Fix scope = PR-3h: promote silu × up multiply + intermediate state to f64 in both expert_swiglu_quantized (CPU, simple) and expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU kernel). Multi-week kernel work. STATUS FLIPS metadata.version: 1.7.1 → 1.7.2 metadata.status: ACTIVE_ALGORITHM_LEVEL (unchanged) AC_GPU_MOE_001: 47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47 KNOWN_DIVERGENCE_NOT_BENIGN WHAT STAYS PENDING - PR-3h fp64 per-expert SwiGLU (multi-week) - M-GPU-MOE-2 wgpu fallback (#1582) - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked by this amendment) WHY NOT KNOWN_BUG L47 is a numerical-precision artifact, not a correctness bug. CPU and GPU follow the same algorithm against the same weights; only the order of f32 accumulation inside the per-expert SwiGLU differs. Both pick legitimate top-8 sets at L47 — neither is wrong — but the small score-perturbation crosses a top-k boundary. Same class as gemv reduction-order variance, one call-stack level higher. REGRESSION GATE FOR PR-3h - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted top-8 == GPU L47 sorted top-8 - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts argmax agreement YAML-ONLY Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 17, 2026 11:02

noahgift mentioned this pull request May 17, 2026

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583

Open

Merge branch 'main' into feat/m-gpu-moe-3-pr3g-argmax-agreement

9e8fa14

noahgift mentioned this pull request May 17, 2026

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN #1747

Merged

2 tasks

noahgift added 3 commits May 17, 2026 14:36

Merge branch 'main' into feat/m-gpu-moe-3-pr3g-argmax-agreement

d338366

Merge branch 'main' into feat/m-gpu-moe-3-pr3g-argmax-agreement

c363a24

Merge branch 'main' into feat/m-gpu-moe-3-pr3g-argmax-agreement

aac37bb

noahgift merged commit 5f8d315 into main May 17, 2026
10 checks passed

noahgift deleted the feat/m-gpu-moe-3-pr3g-argmax-agreement branch May 17, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583)#1745

test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583)#1745
noahgift merged 5 commits into
mainfrom
feat/m-gpu-moe-3-pr3g-argmax-agreement

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Hardware result (lambda-vector RTX 4090, 2026-05-17)

Cascade falsifier sequence

What this PR adds

Test plan

Reproduction

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant