Skip to content

contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan#1492

Merged
noahgift merged 1 commit into
mainfrom
contract/qwen3-moe-forward-gpu-v1-4-0-nan-inf-bisection
May 4, 2026
Merged

contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan#1492
noahgift merged 1 commit into
mainfrom
contract/qwen3-moe-forward-gpu-v1-4-0-nan-inf-bisection

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Pre-implementation contract amendment for M-GPU-MOE-1.4 (downstream NaN/Inf bug discovered after M-GPU-MOE-1.3 partial discharge).

What 1.3 (PR #1491 f0cbe37f9) discharged

  • FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs

What 1.3 exposed (NOT YET fixed)

Heavy test progresses to GPU forward execution but produces NaN/Inf logits, failing:

assert!(gpu_logits.iter().all(|v| v.is_finite()),
        "all GPU logits must be finite (no NaN/Inf)");

Bisection plan (M-GPU-MOE-1.4)

  1. Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path
  2. Bisection — diff CPU vs GPU per-stage on lambda-vector to find first NaN/Inf-producing stage
  3. Fix — apply at bisected stage; class TBD per bisection result

Likely candidates: Q4K matvec accumulator overflow, SwiGLU silu Inf, top-k router renorm div-by-zero, missing per-head Q/K RMSNorm in MoE GPU path.

What this PR ships

  • 110-line v1.4.0 amendment_history block
  • NEW M-GPU-MOE-1.4 implementation_stage (PENDING)
  • M-GPU-MOE-1.3 status: PENDING → PARTIALLY_DISCHARGED
  • Top-level version + status block updated

Validation

pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 errors, 0 warnings

Test plan

  • pv validate clean
  • CI ci/gate green

🤖 Generated with Claude Code

…on plan

Records the next-bug-class finding from the M-GPU-MOE-1.3 partial
discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender
main).

WHAT 1.3 DISCHARGED:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now
  succeeds for qwen3_moe GGUFs.

WHAT 1.3 EXPOSED (NOT YET FIXED):

  Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 +
  cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails:

    assert!(gpu_logits.iter().all(|v| v.is_finite()),
            "all GPU logits must be finite (no NaN/Inf)");

  Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges
  FALSIFY-QW3-MOE-GPU-INVARIANTS-001.

BISECTION PLAN (M-GPU-MOE-1.4):

  (a) Instrumentation — extend apr trace --json --payload to
      capture per-stage tensors on MoE GPU path
  (b) Bisection — diff CPU vs GPU per-stage to find first
      NaN/Inf-producing stage. Candidates:
        * Q4K matvec accumulator overflow
        * SwiGLU silu Inf
        * top-k router renorm div-by-zero
        * missing per-head Q/K RMSNorm in MoE GPU path
  (c) Fix — apply at bisected stage; class TBD by result

THIS PR ADDS:

  * v1.3.0 → v1.4.0 amendment_history block (~110 lines)
  * NEW M-GPU-MOE-1.4 implementation_stage (PENDING)
  * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED
  * Top-level version + status block updated

VALIDATION: pv validate → 0 errors, 0 warnings.

Per CLAUDE.md "NEVER write code before writing a provable contract"
— this PR pins the bisection-and-fix plan BEFORE code. Code
follows in M-GPU-MOE-1.4 fix PR (separate scope).

Refs: M52, M53, M54, M55, M56, R10,
      qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 23:25
@noahgift noahgift merged commit fe88774 into main May 4, 2026
11 checks passed
@noahgift noahgift deleted the contract/qwen3-moe-forward-gpu-v1-4-0-nan-inf-bisection branch May 4, 2026 23:53
noahgift added a commit that referenced this pull request May 4, 2026
Adds diagnostic eprintln BEFORE the finiteness assertion at
qwen3_moe_gpu_parity.rs:168 to provide bisection data when the
test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN)
M-GPU-MOE-1.4 bisection plan.

Stats printed:
  total      = N
  finite     = M
  nan        = N-M (first idx: i)
  inf        = K (first idx: j)
  finite_min = (min over finite)
  finite_max = (max over finite)

LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB
Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04):

  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite.

ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing:

  100% NaN at lm_head means NaN poisoning happens EARLY in the
  pipeline and propagates. Once any one element of a hidden state
  becomes NaN, downstream matmul produces NaN rows, then NaN
  matrices, then 100% NaN logits.

  Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm →
  QKV → per-head Q/K norm → RoPE → attention → attn_output →
  residual → ffn_norm) are CPU-only and SHARED with the CPU forward
  path which produces finite output. So step 10 (MoE FFN via
  moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest
  candidate.

  Within step 10 (GPU per-expert SwiGLU):
    - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow
    - up_q4k_matvec (GPU): same
    - silu(gate) * up (CPU): finite if gate+up finite
    - down_q6k_gemv (GPU): Q6K matvec

  HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv
  outputs from layer 0 expert 0 of the canonical 3-token prompt;
  check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC
  output for the same input).

This diagnostic is permanent — it stays in the test file because:
  1. It costs ~10 lines of code (negligible)
  2. It runs only when the assertion fires (no perf cost on PASS)
  3. Future regressions will produce immediately-actionable
     diagnostic output

Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
Adds diagnostic eprintln BEFORE the finiteness assertion at
qwen3_moe_gpu_parity.rs:168 to provide bisection data when the
test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN)
M-GPU-MOE-1.4 bisection plan.

Stats printed:
  total      = N
  finite     = M
  nan        = N-M (first idx: i)
  inf        = K (first idx: j)
  finite_min = (min over finite)
  finite_max = (max over finite)

LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB
Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04):

  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite.

ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing:

  100% NaN at lm_head means NaN poisoning happens EARLY in the
  pipeline and propagates. Once any one element of a hidden state
  becomes NaN, downstream matmul produces NaN rows, then NaN
  matrices, then 100% NaN logits.

  Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm →
  QKV → per-head Q/K norm → RoPE → attention → attn_output →
  residual → ffn_norm) are CPU-only and SHARED with the CPU forward
  path which produces finite output. So step 10 (MoE FFN via
  moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest
  candidate.

  Within step 10 (GPU per-expert SwiGLU):
    - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow
    - up_q4k_matvec (GPU): same
    - silu(gate) * up (CPU): finite if gate+up finite
    - down_q6k_gemv (GPU): Q6K matvec

  HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv
  outputs from layer 0 expert 0 of the canonical 3-token prompt;
  check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC
  output for the same input).

This diagnostic is permanent — it stays in the test file because:
  1. It costs ~10 lines of code (negligible)
  2. It runs only when the assertion fires (no perf cost on PASS)
  3. Future regressions will produce immediately-actionable
     diagnostic output

Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…#1493)

Adds diagnostic eprintln BEFORE the finiteness assertion at
qwen3_moe_gpu_parity.rs:168 to provide bisection data when the
test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN)
M-GPU-MOE-1.4 bisection plan.

Stats printed:
  total      = N
  finite     = M
  nan        = N-M (first idx: i)
  inf        = K (first idx: j)
  finite_min = (min over finite)
  finite_max = (max over finite)

LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB
Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04):

  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite.

ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing:

  100% NaN at lm_head means NaN poisoning happens EARLY in the
  pipeline and propagates. Once any one element of a hidden state
  becomes NaN, downstream matmul produces NaN rows, then NaN
  matrices, then 100% NaN logits.

  Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm →
  QKV → per-head Q/K norm → RoPE → attention → attn_output →
  residual → ffn_norm) are CPU-only and SHARED with the CPU forward
  path which produces finite output. So step 10 (MoE FFN via
  moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest
  candidate.

  Within step 10 (GPU per-expert SwiGLU):
    - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow
    - up_q4k_matvec (GPU): same
    - silu(gate) * up (CPU): finite if gate+up finite
    - down_q6k_gemv (GPU): Q6K matvec

  HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv
  outputs from layer 0 expert 0 of the canonical 3-token prompt;
  check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC
  output for the same input).

This diagnostic is permanent — it stays in the test file because:
  1. It costs ~10 lines of code (negligible)
  2. It runs only when the assertion fires (no perf cost on PASS)
  3. Future regressions will produce immediately-actionable
     diagnostic output

Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant