contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result by noahgift · Pull Request #1528 · paiml/aprender

noahgift · 2026-05-06T07:11:07Z

Summary

Parallel companion PR to #1527. Records the same M-GPU-MOE-1.4 bisection result from the parent kernel-contract side. Both PRs reference the same evidence dir at evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/.

Bisection result: First NaN_GPU on moe_ffn_out = layer 6, on Blackwell GB10 (gx10), 23.18s wall.

Implementation stage promotion

Stage	v1.4.0	v1.5.0
M-GPU-MOE-1.4	PENDING	PARTIALLY_DISCHARGED (steps a+b done; step c fix OPEN)

Key findings

Layer 6 is the first NaN-emitting GPU stage. L0–L5 all MATCH on both moe_router AND moe_ffn_out (cos > 0.99986). L7+ diverge from downstream NaN poisoning.
Bug is arch-portable. Reproduced on sm_120 (Blackwell GB10) — same defect class as sm_89 (Ada RTX 4090, original M-GPU-MOE-1.3 finding). → algorithmic / numerical, NOT kernel codegen.
v1.4.0's "missing Q/K RMSNorm" hypothesis is REFUTED. Q/K norm runs in attention which is before FFN; if it were missing, divergence would appear earlier than L6.
Bug surface narrowed to:
- moe_ffn_forward_layer_cuda.rs
- expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

Hypotheses refined for step (c) fix

Numerical overflow in expert SwiGLU at L6
Expert weight distribution at L6 produces large activations
Q4K dequant accumulator at L6 overflow

Verification

$ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

Evidence + raw harness output: see #1527.

Production hot paths

Byte-unchanged. YAML-only.

Test plan

pv validate 0/0
Cross-references contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10 #1527 (sibling falsifier-side amendment)
Implementation stage status promoted PENDING → PARTIALLY_DISCHARGED
Refuted v1.4.0 Q/K-norm hypothesis is documented

🤖 Generated with Claude Code

…tep (b) bisection result Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 6, 2026 07:11

Merge branch 'main' into contract/qwen3-moe-gpu-v1.5.0-bisection-result

7bceb7a

noahgift merged commit 22d4e00 into main May 6, 2026
10 checks passed

noahgift deleted the contract/qwen3-moe-gpu-v1.5.0-bisection-result branch May 6, 2026 07:49

noahgift mentioned this pull request May 6, 2026

fix(M-GPU-MOE-1.4 step c): qtype-aware dispatch in expert_swiglu_cuda — closes L6 moe_ffn_out NaN #1529

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result#1528

contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result#1528
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-gpu-v1.5.0-bisection-result

noahgift commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 6, 2026

Summary

Implementation stage promotion

Key findings

Hypotheses refined for step (c) fix

Verification

Production hot paths

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant