contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan#1492
Merged
noahgift merged 1 commit intoMay 4, 2026
Merged
Conversation
…on plan Records the next-bug-class finding from the M-GPU-MOE-1.3 partial discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender main). WHAT 1.3 DISCHARGED: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs. WHAT 1.3 EXPOSED (NOT YET FIXED): Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails: assert!(gpu_logits.iter().all(|v| v.is_finite()), "all GPU logits must be finite (no NaN/Inf)"); Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges FALSIFY-QW3-MOE-GPU-INVARIANTS-001. BISECTION PLAN (M-GPU-MOE-1.4): (a) Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path (b) Bisection — diff CPU vs GPU per-stage to find first NaN/Inf-producing stage. Candidates: * Q4K matvec accumulator overflow * SwiGLU silu Inf * top-k router renorm div-by-zero * missing per-head Q/K RMSNorm in MoE GPU path (c) Fix — apply at bisected stage; class TBD by result THIS PR ADDS: * v1.3.0 → v1.4.0 amendment_history block (~110 lines) * NEW M-GPU-MOE-1.4 implementation_stage (PENDING) * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED * Top-level version + status block updated VALIDATION: pv validate → 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the bisection-and-fix plan BEFORE code. Code follows in M-GPU-MOE-1.4 fix PR (separate scope). Refs: M52, M53, M54, M55, M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…#1493) Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pre-implementation contract amendment for M-GPU-MOE-1.4 (downstream NaN/Inf bug discovered after M-GPU-MOE-1.3 partial discharge).
What 1.3 (PR #1491
f0cbe37f9) dischargedWhat 1.3 exposed (NOT YET fixed)
Heavy test progresses to GPU forward execution but produces NaN/Inf logits, failing:
Bisection plan (M-GPU-MOE-1.4)
apr trace --json --payloadto capture per-stage tensors on MoE GPU pathLikely candidates: Q4K matvec accumulator overflow, SwiGLU silu Inf, top-k router renorm div-by-zero, missing per-head Q/K RMSNorm in MoE GPU path.
What this PR ships
M-GPU-MOE-1.4implementation_stage (PENDING)Validation
Test plan
🤖 Generated with Claude Code