Skip to content

test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583)#1745

Merged
noahgift merged 5 commits into
mainfrom
feat/m-gpu-moe-3-pr3g-argmax-agreement
May 17, 2026
Merged

test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583)#1745
noahgift merged 5 commits into
mainfrom
feat/m-gpu-moe-3-pr3g-argmax-agreement

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

PR-3g of the #1583 M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier — runs 4 canonical prompts through both CPU and GPU full forwards, prints the argmax agreement table + verdict.

Hardware result (lambda-vector RTX 4090, 2026-05-17)

PROMPT             | CPU argmax (val)     | GPU argmax (val)
canonical_3tok     |    944 ( 13.7270)  |    944 ( 14.4133)  ✓
single_tok_785     |    220 ( 15.5523)  |     25 ( 18.5098)  ✗ MISMATCH
multi_tok_short    |    315 ( 26.2279)  |    315 ( 25.5230)  ✓
multi_tok_code     |    198 ( 17.7453)  |    198 ( 17.8433)  ✓

3/4 prompts agree, 1 disagrees. L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU intermediates).

Cascade falsifier sequence

The full cascade is now:

PR Result Why this fix didn't close L47
PR-2 #1737 ✅ closed 47/48 layers fp64 q6k_gemv acc was necessary but not sufficient
PR-3d H(i) FALSIFIED qtypes are identical L0/L46/L47
PR-3e2 #1743 H(ii) CONFIRMED 2 of 8 experts swap at L47
PR-3f1 FALSIFIED drift is upstream of softmax
PR-3f2 FALSIFIED drift is upstream of weighted-sum
PR-3g (this) L47 NOT BENIGN argmax flips on 1/4 prompts
PR-3h (next) pending Option C: fp64 in per-expert SwiGLU

By elimination, the drift source must be inside each per-expert SwiGLU's intermediate chain: silu(gate) × up at f32 + the hidden-dim×4 intermediate state at f32 (the down-proj q6k_gemv acc is already fp64 thanks to PR-2).

What this PR adds

crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs:

  • new test falsify_qw3_moe_gpu_argmax_agreement — multi-prompt PROBE that builds CPU + GPU models once, runs 4 canonical prompts, prints agreement table + verdict (BENIGN / NOT BENIGN + diff). Same #[ignore] + #[cfg(feature = \"cuda\")] gates as siblings.

Test plan

  • Compiles with --features cuda
  • Runs on RTX 4090 in 24.45s
  • Verdict line classifies BENIGN / NOT BENIGN
  • Prints disagreement details so PR-3h knows which prompts to test against
  • PROBE not hard-fail (no assertion)

Reproduction

cargo test --release --features cuda \
  -p aprender-serve --test qwen3_moe_gpu_parity \
  falsify_qw3_moe_gpu_argmax_agreement \
  -- --ignored --nocapture

🤖 Generated with Claude Code

…GN (#1583)

PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually
user-visible" falsifier, runs 4 canonical prompts through both CPU and
GPU full forwards, and asserts argmax agreement.

## Result (lambda-vector RTX 4090, 2026-05-17)

  PROMPT             | CPU argmax (val)     | GPU argmax (val)
  canonical_3tok     |    944 ( 13.7270)  |    944 ( 14.4133)  ✓
  single_tok_785     |    220 ( 15.5523)  |     25 ( 18.5098)  ✗ MISMATCH
  multi_tok_short    |    315 ( 26.2279)  |    315 ( 25.5230)  ✓
  multi_tok_code     |    198 ( 17.7453)  |    198 ( 17.8433)  ✓

**3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the
expert-set divergence DOES flip the top-1 predicted token for some
prompts (~25% in this small sample). Option E (Accept) is off the
table; must pursue Option C (fp64 in per-expert SwiGLU).

## What this PR adds

  crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs:
    + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt
      probe that builds CPU + GPU models once, runs 4 canonical prompts
      through both full forwards, and prints argmax agreement table +
      verdict. PROBE not hard-assert; prints "BENIGN" if all agree or
      "NOT BENIGN" + disagreeing prompts otherwise.

## Cascade context

- PR-1 #1713 ✅ per-layer cos falsifier
- PR-2 #1737 ✅ q6k_gemv fp64 accumulators
- PR-3   ✅ hardware verify — 47/48 PASS, L47 surfaces
- PR-3b #1739 ✅ contract v1.7.0 → v1.7.1
- PR-3c #1740 ✅ scope-doc + L47 sub-cascade
- PR-3d   ✅ H(i) qtype-mismatch FALSIFIED
- PR-3e #1741 ✅ router-weight probe
- PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap)
- PR-3f1 ❌ falsified (fp64 softmax) — dropped
- PR-3f2 ❌ falsified (f64 weighted-sum) — dropped
- PR-3g  ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix
- PR-3h  pending — Option C fp64 in per-expert SwiGLU intermediates

## Why the cascade kept eliminating candidates

The 3-falsifier sequence ruled out the "easy" fix locations:

1. PR-3f1 (gate softmax precision) — drift upstream of softmax
2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum
3. **Remaining**: drift inside each per-expert SwiGLU's intermediate
   chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc
   which PR-2 already promoted to fp64)

PR-3h must promote the silu(gate) × up element-wise multiply and the
hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU
and CUDA expert_swiglu helpers.

## Reproduction

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_gpu_parity \
    falsify_qw3_moe_gpu_argmax_agreement \
    -- --ignored --nocapture

~25s on RTX 4090.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN (#1747)

* docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.0 → v1.7.1 — M-GPU-MOE-3 PR-2 verified, L47 surfaced

Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main
(#1737, 88ce47f — q6k_gemv fp64 accumulators).

PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on
lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One
layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below
the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in
GitHub comment on #1583 (issuecomment-4470195446).

The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46,
v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a
real win. L47 was previously undetected because no per-layer
falsifier existed in-tree; PR-1 of this cascade (#1713) closed that
gap and surfaced the L47 anomaly.

WHAT FLIPS:

  metadata.version 1.7.0 → 1.7.1
  bottom-of-file version: "1.7.0" → "1.7.1"
  bottom-of-file status comment refreshed:
    "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING"
    → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING"

  AC_GPU_MOE_001 stage status text refresh (text-only — not yet
  refactored into a new amendment_history entry since this PR is
  scoped to the v1.7.1 amendment block only).

WHAT STAYS PENDING:

  - L47 single-layer cascade — root cause unknown. Three candidate
    hypotheses captured in the v1.7.1 amendment block (qtype mismatch,
    MoE expert distribution, stride/shape boundary). Forthcoming PR-3c
    surfaces §85 (or next-available section) covering the L47 cascade.
    Forthcoming PR-3d+: per-tensor histogram on L47 before authoring
    fix.
  - M-GPU-MOE-2 (wgpu fallback) — unchanged
  - M-GPU-MOE-3 PR-4 throughput — unchanged

YAML-ONLY:

  Production hot paths byte-unchanged. Additive-purity invariant
  pinned in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN

Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade.

After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post
fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier
sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root
cause and verify user-visible impact.

OUTCOME

  PR-3   ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236
  PR-3d  ❌ H(i) qtype-mismatch FALSIFIED
  PR-3e  ✅ #1741 — L47 first divergent router (cos 0.9926)
  PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47
  PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream
  PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream
  PR-3g  ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees
                    → L47 NOT BENIGN (~25% prompt-dependent impact)

ROOT CAUSE (by elimination)

  Per-expert SwiGLU f32 intermediates:
    1. gate_proj @ hidden   ← fp64 acc thanks to PR-2 ✅
    2. silu(gate)           ← f32 ✗
    3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗
    4. down_proj @ above    ← fp64 acc thanks to PR-2 ✅

  Fix scope = PR-3h: promote silu × up multiply + intermediate state
  to f64 in both expert_swiglu_quantized (CPU, simple) and
  expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU
  kernel). Multi-week kernel work.

STATUS FLIPS

  metadata.version:  1.7.1 → 1.7.2
  metadata.status:   ACTIVE_ALGORITHM_LEVEL (unchanged)
  AC_GPU_MOE_001:    47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47
                     KNOWN_DIVERGENCE_NOT_BENIGN

WHAT STAYS PENDING

  - PR-3h fp64 per-expert SwiGLU (multi-week)
  - M-GPU-MOE-2 wgpu fallback (#1582)
  - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked
    by this amendment)

WHY NOT KNOWN_BUG

  L47 is a numerical-precision artifact, not a correctness bug. CPU
  and GPU follow the same algorithm against the same weights; only
  the order of f32 accumulation inside the per-expert SwiGLU differs.
  Both pick legitimate top-8 sets at L47 — neither is wrong — but
  the small score-perturbation crosses a top-k boundary. Same class
  as gemv reduction-order variance, one call-stack level higher.

REGRESSION GATE FOR PR-3h

  - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted
    top-8 == GPU L47 sorted top-8
  - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts
    argmax agreement

YAML-ONLY

  Production hot paths byte-unchanged. Additive-purity invariant pinned
  in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 5f8d315 into main May 17, 2026
10 checks passed
@noahgift noahgift deleted the feat/m-gpu-moe-3-pr3g-argmax-agreement branch May 17, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant