contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan by noahgift · Pull Request #1492 · paiml/aprender

noahgift · 2026-05-04T23:25:16Z

Summary

Pre-implementation contract amendment for M-GPU-MOE-1.4 (downstream NaN/Inf bug discovered after M-GPU-MOE-1.3 partial discharge).

What 1.3 (PR #1491 `f0cbe37f9`) discharged

✅ FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs

What 1.3 exposed (NOT YET fixed)

Heavy test progresses to GPU forward execution but produces NaN/Inf logits, failing:

assert!(gpu_logits.iter().all(|v| v.is_finite()),
        "all GPU logits must be finite (no NaN/Inf)");

Bisection plan (M-GPU-MOE-1.4)

Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path
Bisection — diff CPU vs GPU per-stage on lambda-vector to find first NaN/Inf-producing stage
Fix — apply at bisected stage; class TBD per bisection result

Likely candidates: Q4K matvec accumulator overflow, SwiGLU silu Inf, top-k router renorm div-by-zero, missing per-head Q/K RMSNorm in MoE GPU path.

What this PR ships

110-line v1.4.0 amendment_history block
NEW M-GPU-MOE-1.4 implementation_stage (PENDING)
M-GPU-MOE-1.3 status: PENDING → PARTIALLY_DISCHARGED
Top-level version + status block updated

Validation

pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 errors, 0 warnings

Test plan

pv validate clean
CI ci/gate green

🤖 Generated with Claude Code

…on plan Records the next-bug-class finding from the M-GPU-MOE-1.3 partial discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender main). WHAT 1.3 DISCHARGED: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs. WHAT 1.3 EXPOSED (NOT YET FIXED): Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails: assert!(gpu_logits.iter().all(|v| v.is_finite()), "all GPU logits must be finite (no NaN/Inf)"); Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges FALSIFY-QW3-MOE-GPU-INVARIANTS-001. BISECTION PLAN (M-GPU-MOE-1.4): (a) Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path (b) Bisection — diff CPU vs GPU per-stage to find first NaN/Inf-producing stage. Candidates: * Q4K matvec accumulator overflow * SwiGLU silu Inf * top-k router renorm div-by-zero * missing per-head Q/K RMSNorm in MoE GPU path (c) Fix — apply at bisected stage; class TBD by result THIS PR ADDS: * v1.3.0 → v1.4.0 amendment_history block (~110 lines) * NEW M-GPU-MOE-1.4 implementation_stage (PENDING) * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED * Top-level version + status block updated VALIDATION: pv validate → 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the bisection-and-fix plan BEFORE code. Code follows in M-GPU-MOE-1.4 fix PR (separate scope). Refs: M52, M53, M54, M55, M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…#1493) Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 23:25

noahgift mentioned this pull request May 4, 2026

test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings #1493

Merged

3 tasks

noahgift merged commit fe88774 into main May 4, 2026
11 checks passed

noahgift deleted the contract/qwen3-moe-forward-gpu-v1-4-0-nan-inf-bisection branch May 4, 2026 23:53

noahgift mentioned this pull request May 5, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.0.0 PROPOSED scaffold for M-GPU-MOE-1.4 bisection #1498

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan#1492

contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan#1492
noahgift merged 1 commit into
mainfrom
contract/qwen3-moe-forward-gpu-v1-4-0-nan-inf-bisection

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

What 1.3 (PR #1491 f0cbe37f9) discharged

What 1.3 exposed (NOT YET fixed)

Bisection plan (M-GPU-MOE-1.4)

What this PR ships

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What 1.3 (PR #1491 `f0cbe37f9`) discharged