test(aprender-serve): qwen3_moe_gpu_parity — NaN/Inf diagnostic stats + live findings#1493
Merged
Merged
Conversation
7f678c6 to
89d91b7
Compare
Adds diagnostic eprintln BEFORE the finiteness assertion at qwen3_moe_gpu_parity.rs:168 to provide bisection data when the test fires. Per qwen3-moe-forward-gpu-v1 v1.4.0 (PR #1492 OPEN) M-GPU-MOE-1.4 bisection plan. Stats printed: total = N finite = M nan = N-M (first idx: i) inf = K (first idx: j) finite_min = (min over finite) finite_max = (max over finite) LIVE EVIDENCE on lambda-vector RTX 4090 against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF (post-M-GPU-MOE-1.3, 2026-05-04): total = 151936 finite = 0 nan = 151936 (first idx: Some(0)) inf = 0 (first idx: None) finite_min = inf finite_max = -inf CRITICAL FINDING: ALL output logits are NaN. None Inf, none finite. ANALYSIS for M-GPU-MOE-1.4 bisection scope narrowing: 100% NaN at lm_head means NaN poisoning happens EARLY in the pipeline and propagates. Once any one element of a hidden state becomes NaN, downstream matmul produces NaN rows, then NaN matrices, then 100% NaN logits. Steps 1-9 of forward_qwen3_moe_cuda (token_embed → attn_norm → QKV → per-head Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and SHARED with the CPU forward path which produces finite output. So step 10 (MoE FFN via moe_ffn_forward_layer_cuda → expert_swiglu_cuda) is the strongest candidate. Within step 10 (GPU per-expert SwiGLU): - gate_q4k_matvec (GPU): Q4K matvec, possible accumulator overflow - up_q4k_matvec (GPU): same - silu(gate) * up (CPU): finite if gate+up finite - down_q6k_gemv (GPU): Q6K matvec HIGH-CONFIDENCE NEXT STEP: instrument q4k_matvec / q6k_gemv outputs from layer 0 expert 0 of the canonical 3-token prompt; check if those are already NaN (vs the CPU LAZY-FUSED-MATVEC output for the same input). This diagnostic is permanent — it stays in the test file because: 1. It costs ~10 lines of code (negligible) 2. It runs only when the assertion fires (no perf cost on PASS) 3. Future regressions will produce immediately-actionable diagnostic output Refs: M52-M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
89d91b7 to
82ed76b
Compare
Merged
4 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…on DISCHARGED on gx10 (#1527) Operator-dispatched run of the M80 heavy harness on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6 moe_ffn_out**. Per-layer cos-sim summary (full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt): - L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out) - L6 first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+ all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." Status promotions: - FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED (heavy harness ran cleanly with --include-ignored) - FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED (bisection-pinpoints-stage; stage = L6 moe_ffn_out) - FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR pending — must cite L6 moe_ffn_out by name) Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE. Architectural portability finding: This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch. Bug surface narrowed for M-GPU-MOE-1.4 fix scope: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv YAML + evidence-only — production hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0 still holds). `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…tep (b) bisection result (#1528) Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds diagnostic eprintln before the finiteness assertion to provide bisection data when the test fires.
Live finding (2026-05-04, lambda-vector RTX 4090)
100% of output logits are NaN. None Inf. None finite.
Bisection narrowing for M-GPU-MOE-1.4
Steps 1-9 of
forward_qwen3_moe_cuda(embedding → attn_norm → QKV → Q/K norm → RoPE → attention → attn_output → residual → ffn_norm) are CPU-only and shared with the CPU forward path. Since CPU produces finite output, those steps are clean.→ Step 10 (GPU MoE FFN) is the strongest candidate. Within step 10:
gate_q4k_matvec/up_q4k_matvec/down_q6k_gemv(GPU Q4K/Q6K matvecs)silu(gate) * up(CPU, deterministic if inputs finite)High-confidence next step: instrument
q4k_matvec/q6k_gemvoutputs from layer-0 expert-0 against the same CPU input; check if those are already NaN.Why permanent
Diagnostic is ~10 lines, runs only when the assertion fires (no PASS-path cost), and future regressions get immediately-actionable output.
Stacks on
PR #1492 (v1.4.0 amendment pinning the bisection plan).
Test plan
🤖 Generated with Claude Code