contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583)#1825
Merged
noahgift merged 2 commits intoMay 19, 2026
Conversation
…de DISCHARGE amendment (#1583 spec advancement) Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause of the 0.94-cos drop on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. ## TRUE root cause CPU fused_q4k_parallel_matvec = Q4_K(weights) × Q8_K(activations) CUDA q4k_matvec = Q4_K(weights) × f32_activations Different mathematical operations. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop. ## What this amendment REFUTES - v1.7.2's 'per-expert SwiGLU f32 intermediates' attribution → refuted by #1818 (SwiGLU intrinsic precision is ulp-scale) - v1.0.0..v1.7.1 'Q6_K fp-accumulator-order' framing → refuted by #1801 (synthetic ulp-scale) + #1816 (real Q6_K ulp-scale) - Q6_K-specific root-cause hypothesis → refuted by #1816's structural finding (L7/L9/L12 are Q4_K, not Q6_K) ## Status change v1.7.2: ACTIVE_ALGORITHM_LEVEL (with wrong SwiGLU attribution) v1.8.0: ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE - 47/48 layers cos≥0.99 stands - root cause documented (activation-qtype algorithm mismatch) - L47 cliff is the natural compositional consequence - 0.94-cos on 7 problem layers is documented, not a bug ## Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4) OPTION 1: CPU uses f32 activations (slow CPU) OPTION 2: CUDA uses Q8_K activations (RECOMMENDED — DP4A faster) PackedDp4aQ4KQ8Kernel already exists; just need CUDA f32→Q8_K activation quant kernel to feed it. OPTION 3: Document divergence; relax cos threshold ## Validation - python3 yaml.safe_load: PASS - pv validate contracts/qwen3-moe-forward-gpu-v1.yaml: 0 errors, 0 warnings ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Cascade: #1801, #1805, #1811, #1816, #1818, #1821, #1822 - Sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) — real-model parity gate Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause.
True root cause
They compute DIFFERENT MATHEMATICAL OPERATIONS. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop on the 7 problem layers (L7/L9/L12/L20/L23/L29/L46).
What v1.8.0 REFUTES
Status change
The 47/48 layers cos≥0.99 measurement stands; the L47 cliff and the 0.94-cos drop on 7 problem layers are now documented as the natural compositional consequence of the CPU/CUDA algorithm mismatch.
Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4)
Test plan
Cross-refs
🤖 Generated with Claude Code