Skip to content

test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings#1493

Merged
noahgift merged 1 commit into
mainfrom
test/qwen3-moe-gpu-nan-inf-diag-m-1-4
May 5, 2026
Merged

test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings#1493
noahgift merged 1 commit into
mainfrom
test/qwen3-moe-gpu-nan-inf-diag-m-1-4

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds diagnostic eprintln before the finiteness assertion to provide bisection data when the test fires.

Live finding (2026-05-04, lambda-vector RTX 4090)

FALSIFY-QW3-MOE-GPU-PARITY-001 finiteness diagnostic:
  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

100% of output logits are NaN. None Inf. None finite.

Bisection narrowing for M-GPU-MOE-1.4

Steps 1-9 of forward_qwen3_moe_cuda (embedding → attn_norm → QKV → Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and shared with the CPU forward path. Since CPU produces finite output, those steps are clean.

Step 10 (GPU MoE FFN) is the strongest candidate. Within step 10:

  • gate_q4k_matvec / up_q4k_matvec / down_q6k_gemv (GPU Q4K/Q6K matvecs)
  • silu(gate) * up (CPU, deterministic if inputs finite)

High-confidence next step: instrument q4k_matvec / q6k_gemv outputs from layer-0 expert-0 against the same CPU input; check if those are already NaN.

Why permanent

Diagnostic is ~10 lines, runs only when the assertion fires (no PASS-path cost), and future regressions get immediately-actionable output.

Stacks on

PR #1492 (v1.4.0 amendment pinning the bisection plan).

Test plan

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 4, 2026 23:33
@noahgift noahgift force-pushed the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch from 7f678c6 to 89d91b7 Compare May 4, 2026 23:57
Adds diagnostic eprintln BEFORE the finiteness assertion at
qwen3_moe_gpu_parity.rs:168 to provide bisection data when the
test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN)
M-GPU-MOE-1.4 bisection plan.

Stats printed:
  total      = N
  finite     = M
  nan        = N-M (first idx: i)
  inf        = K (first idx: j)
  finite_min = (min over finite)
  finite_max = (max over finite)

LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB
Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04):

  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite.

ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing:

  100% NaN at lm_head means NaN poisoning happens EARLY in the
  pipeline and propagates. Once any one element of a hidden state
  becomes NaN, downstream matmul produces NaN rows, then NaN
  matrices, then 100% NaN logits.

  Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm →
  QKV → per-head Q/K norm → RoPE → attention → attn_output →
  residual → ffn_norm) are CPU-only and SHARED with the CPU forward
  path which produces finite output. So step 10 (MoE FFN via
  moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest
  candidate.

  Within step 10 (GPU per-expert SwiGLU):
    - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow
    - up_q4k_matvec (GPU): same
    - silu(gate) * up (CPU): finite if gate+up finite
    - down_q6k_gemv (GPU): Q6K matvec

  HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv
  outputs from layer 0 expert 0 of the canonical 3-token prompt;
  check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC
  output for the same input).

This diagnostic is permanent — it stays in the test file because:
  1. It costs ~10 lines of code (negligible)
  2. It runs only when the assertion fires (no perf cost on PASS)
  3. Future regressions will produce immediately-actionable
     diagnostic output

Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch from 89d91b7 to 82ed76b Compare May 5, 2026 00:33
@noahgift noahgift merged commit 36b6a79 into main May 5, 2026
10 checks passed
@noahgift noahgift deleted the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch May 5, 2026 00:50
noahgift added a commit that referenced this pull request May 6, 2026
…on DISCHARGED on gx10 (#1527)

Operator-dispatched run of the M80 heavy harness on Blackwell GB10
(gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF
completed cleanly in 23.18s and produced the expected bisection
signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6
moe_ffn_out**.

Per-layer cos-sim summary (full output:
evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt):
- L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out)
- L6 first NaN_GPU on moe_ffn_out (router still finite at L6)
- L7+ all DIVERGE on router (downstream NaN poisoning)

Decision tree firing per harness output:
  "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH:
   bug is layer-N specific (rare)."

Status promotions:
- FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED
  (heavy harness ran cleanly with --include-ignored)
- FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED
  (bisection-pinpoints-stage; stage = L6 moe_ffn_out)
- FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR
  pending — must cite L6 moe_ffn_out by name)

Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE.

Architectural portability finding:
This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN
(PR #1493) was characterized on sm_89 (Ada RTX 4090). Both
architectures produce NaN at the same layer → bug is algorithmic /
numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT
pre-warming did NOT block the dispatch.

Bug surface narrowed for M-GPU-MOE-1.4 fix scope:
- crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs
- crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

YAML + evidence-only — production hot paths byte-unchanged
(additive-purity invariant pinned in v1.1.0 still holds).

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…tep (b) bisection result (#1528)

Records the LIVE bisection result for M-GPU-MOE-1.4 step (b)
from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06.

WHAT THE BISECTION FOUND:
- First NaN_GPU on `moe_ffn_out` = layer 6
- L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986)
- L6: first NaN_GPU on moe_ffn_out (router still finite at L6)
- L7+: all DIVERGE on router (downstream NaN poisoning)

Decision tree firing per harness output:
  "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH:
   bug is layer-N specific (rare)."

ARCHITECTURAL PORTABILITY:
Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3
NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both
architectures produce NaN at the same layer → bug is algorithmic /
numerical, NOT kernel codegen. A single fix at the bisected stage
discharges both arch-specific manifestations.

BUG SURFACE NARROWED:
- crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs
- crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

The bug is NOT in the routing logic (router stays finite at L6).
It's in the per-expert FFN computation at layer 6 specifically.

HYPOTHESES (refined from v1.4.0, priority order):
1. Numerical overflow in expert SwiGLU at L6
2. Expert weight distribution at L6 produces large activations
3. Q4K dequant accumulator at L6 overflow

The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED —
Q/K norm runs in attention which is BEFORE FFN; if it were missing
the divergence would appear earlier than L6.

IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED.
- Step (a) instrumentation cascade: COMPLETE (M50→M81)
- Step (b) LIVE bisection: COMPLETE (this evidence)
- Step (c) fix: OPEN

Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0
amendment in PR #1527 records the same bisection result from
the falsifier-side. This PR records it from the parent kernel-
contract side; both refer to the same evidence dir.

YAML-only — production hot paths byte-unchanged.

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant