contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result#1528
Merged
Merged
Conversation
…tep (b) bisection result Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Parallel companion PR to #1527. Records the same M-GPU-MOE-1.4 bisection result from the parent kernel-contract side. Both PRs reference the same evidence dir at
evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/.Bisection result: First NaN_GPU on
moe_ffn_out= layer 6, on Blackwell GB10 (gx10), 23.18s wall.Implementation stage promotion
Key findings
Layer 6 is the first NaN-emitting GPU stage. L0–L5 all MATCH on both moe_router AND moe_ffn_out (cos > 0.99986). L7+ diverge from downstream NaN poisoning.
Bug is arch-portable. Reproduced on sm_120 (Blackwell GB10) — same defect class as sm_89 (Ada RTX 4090, original M-GPU-MOE-1.3 finding). → algorithmic / numerical, NOT kernel codegen.
v1.4.0's "missing Q/K RMSNorm" hypothesis is REFUTED. Q/K norm runs in attention which is before FFN; if it were missing, divergence would appear earlier than L6.
Bug surface narrowed to:
moe_ffn_forward_layer_cuda.rsexpert_swiglu_cuda.rsCudaExecutor::q4k_matvec/q6k_gemvHypotheses refined for step (c) fix
Verification
Evidence + raw harness output: see #1527.
Production hot paths
Byte-unchanged. YAML-only.
Test plan
pv validate0/0🤖 Generated with Claude Code