test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings by noahgift · Pull Request #1493 · paiml/aprender

noahgift · 2026-05-04T23:32:55Z

Summary

Adds diagnostic eprintln before the finiteness assertion to provide bisection data when the test fires.

Live finding (2026-05-04, lambda-vector RTX 4090)

FALSIFY-QW3-MOE-GPU-PARITY-001 finiteness diagnostic:
  total      = 151936
  finite     = 0
  nan        = 151936 (first idx: Some(0))
  inf        = 0 (first idx: None)
  finite_min = inf
  finite_max = -inf

100% of output logits are NaN. None Inf. None finite.

Bisection narrowing for M-GPU-MOE-1.4

Steps 1-9 of forward_qwen3_moe_cuda (embedding → attn_norm → QKV → Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and shared with the CPU forward path. Since CPU produces finite output, those steps are clean.

→ Step 10 (GPU MoE FFN) is the strongest candidate. Within step 10:

gate_q4k_matvec / up_q4k_matvec / down_q6k_gemv (GPU Q4K/Q6K matvecs)
silu(gate) * up (CPU, deterministic if inputs finite)

High-confidence next step: instrument q4k_matvec / q6k_gemv outputs from layer-0 expert-0 against the same CPU input; check if those are already NaN.

Why permanent

Diagnostic is ~10 lines, runs only when the assertion fires (no PASS-path cost), and future regressions get immediately-actionable output.

Stacks on

PR #1492 (v1.4.0 amendment pinning the bisection plan).

Test plan

Live-verified on RTX 4090 — produces useful diagnostic
CI ci/gate green
PR contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan #1492 lands first

🤖 Generated with Claude Code

Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…on DISCHARGED on gx10 (#1527) Operator-dispatched run of the M80 heavy harness on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6 moe_ffn_out**. Per-layer cos-sim summary (full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt): - L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out) - L6 first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+ all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." Status promotions: - FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED (heavy harness ran cleanly with --include-ignored) - FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED (bisection-pinpoints-stage; stage = L6 moe_ffn_out) - FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR pending — must cite L6 moe_ffn_out by name) Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE. Architectural portability finding: This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch. Bug surface narrowed for M-GPU-MOE-1.4 fix scope: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv YAML + evidence-only — production hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0 still holds). `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…tep (b) bisection result (#1528) Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 23:33

noahgift force-pushed the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch from 7f678c6 to 89d91b7 Compare May 4, 2026 23:57

noahgift force-pushed the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch from 89d91b7 to 82ed76b Compare May 5, 2026 00:33

noahgift merged commit 36b6a79 into main May 5, 2026
10 checks passed

noahgift deleted the test/qwen3-moe-gpu-nan-inf-diag-m-1-4 branch May 5, 2026 00:50

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10 #1527

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings#1493

test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings#1493
noahgift merged 1 commit into
mainfrom
test/qwen3-moe-gpu-nan-inf-diag-m-1-4

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Live finding (2026-05-04, lambda-vector RTX 4090)

Bisection narrowing for M-GPU-MOE-1.4

Why permanent

Stacks on

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant