docs(m-gpu-moe-3): scope update — PR-2 landed in-tree, L47 sub-cascade surfaced (#1583)#1740
Merged
Merged
Conversation
…e surfaced (#1583) PR-3c of the M-GPU-MOE-3 cascade. Updates m-gpu-moe-3-scope.md with the actual landed state and the new L47 single-layer sub-cascade. WHAT CHANGED Cascade table now reflects shipped PRs: PR-1 ✅ shipped (#1713) — per-layer cos falsifier PR-2 ✅ shipped (#1737) — fp64 accumulators in Q6KGemvKernel. Note: in-tree at crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs (the original "../trueno" reference was stale after the monorepo consolidation subsumed trueno-gpu). PR-3 ✅ ran (manual hardware verification on lambda-vector RTX 4090, 2026-05-17) — 47/48 layers cos ≥ 0.99, L47 alone at cos=0.961236. Evidence in #1583 comment-4470195446. PR-3b ✅ shipped (#1739) — contract v1.7.0 → v1.7.1. PR-3c ✅ this update. PR-3d ✅ ran — H(i) qtype-mismatch FALSIFIED. apr tensors shows L0, L46, L47 have identical shapes + qtypes. Evidence in #1583 comment-4470216021. New sub-cascade for L47: PR-3e — pending: routing-divergence falsifier for H(ii). Hypothesis: per-layer cosine is ACCUMULATED drift, not per-kernel divergence. By L47 the CPU-vs-GPU hidden state has drifted by ~0.002. If that drift straddles a top-k expert boundary at L47, CPU and GPU pick different expert sets and the FFN output diverges by O(1) — matching the 0.961 cliff. The falsifier extends SaveTensorStage::MoeRouter (or adds a sibling stage) to persist top-k EXPERT INDICES alongside the weights, then asserts CPU index set == GPU index set at L47. PR-3f+ — pending: L47 fix based on PR-3e outcome. - If H(ii) confirmed: deterministic tie-breaking in expert ordering OR fp64 MoE gate softmax OR f64 expert selection with f32 post-conversion. - If H(ii) dead: per-expert weight cancellation pathology investigation (capture FfnGate + FfnUp + FfnSwigl at L47). Parallel work: PR-4 (throughput ≥150 tok/s + VRAM ≤95%) — independent of L47 sub-cascade. PR-5 (contract v1.7.1 → v1.8.0 ACTIVE_RUNTIME) — gates on PR-3f+ AND PR-4. REPRODUCTION cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ -- --ignored --nocapture 27.92s on RTX 4090. Test source: crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs DOC-ONLY PR No code changes. Production hot paths byte-unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 17, 2026
…GN (#1583) PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier, runs 4 canonical prompts through both CPU and GPU full forwards, and asserts argmax agreement. ## Result (lambda-vector RTX 4090, 2026-05-17) PROMPT | CPU argmax (val) | GPU argmax (val) canonical_3tok | 944 ( 13.7270) | 944 ( 14.4133) ✓ single_tok_785 | 220 ( 15.5523) | 25 ( 18.5098) ✗ MISMATCH multi_tok_short | 315 ( 26.2279) | 315 ( 25.5230) ✓ multi_tok_code | 198 ( 17.7453) | 198 ( 17.8433) ✓ **3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU). ## What this PR adds crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs: + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt probe that builds CPU + GPU models once, runs 4 canonical prompts through both full forwards, and prints argmax agreement table + verdict. PROBE not hard-assert; prints "BENIGN" if all agree or "NOT BENIGN" + disagreeing prompts otherwise. ## Cascade context - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify — 47/48 PASS, L47 surfaces - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap) - PR-3f1 ❌ falsified (fp64 softmax) — dropped - PR-3f2 ❌ falsified (f64 weighted-sum) — dropped - PR-3g ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix - PR-3h pending — Option C fp64 in per-expert SwiGLU intermediates ## Why the cascade kept eliminating candidates The 3-falsifier sequence ruled out the "easy" fix locations: 1. PR-3f1 (gate softmax precision) — drift upstream of softmax 2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum 3. **Remaining**: drift inside each per-expert SwiGLU's intermediate chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc which PR-2 already promoted to fp64) PR-3h must promote the silu(gate) × up element-wise multiply and the hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU and CUDA expert_swiglu helpers. ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_gpu_parity \ falsify_qw3_moe_gpu_argmax_agreement \ -- --ignored --nocapture ~25s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 17, 2026
…router weights diverge (#1583) (#1741) PR-3e of the M-GPU-MOE-3 cascade. Adds `falsify_qw3_moe_l47_router_probe` to disambiguate H(ii) routing-divergence from post-routing divergence at L47. ## Result (lambda-vector RTX 4090, 2026-05-17) L## | MoeRouter | MoeFfnOut L00 | 1.000000 | 0.999999 L01 | 1.000000 | 1.000000 ... L46 | 1.000000 | 0.998498 L47 | 0.992558 | 0.961236 <-- FfnOut BELOW 0.99 **Dispositive evidence:** L0..L46 router cos = 1.000000 (byte-identical between CPU and GPU). L47 router cos = 0.992558 — the FIRST and ONLY layer where router weights diverge. This is a sharp transition, exactly the signature of an accumulated-drift threshold being crossed at the L47 softmax/top-k boundary. ## Verdict H(ii) routing-divergence is INCONCLUSIVE from weights alone (router cos 0.992558 is between 0.99 and 0.995). The saved MoeRouter tensor is `[k=8]` post-softmax+renormalize WEIGHTS in descending order — if CPU and GPU pick DIFFERENT 8 experts, both vectors are still sorted descending and cos can be near 1.0 even with disjoint sets. Indices are required to definitively confirm or falsify SET divergence. PR-3e2 will add `SaveTensorStage::MoeRouterIndices` to persist the top-k INDICES alongside the weights. Once indices are captured, the test can assert CPU set == GPU set at L47 to lock in H(ii). ## What's in this PR crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs: + new helper `make_router_and_ffn_out_plan` capturing both `moe_router` and `moe_ffn_out` stages for all 48 layers + new test `falsify_qw3_moe_l47_router_probe` printing the per-layer router+ffn_out cos vector side-by-side and a verdict line classifying H(ii) status (FALSIFIED / ALIVE / INCONCLUSIVE) + same `#[ignore]` + `#![cfg(feature = "cuda")]` gates as the existing falsifier — runs only on RTX 4090 with the cached 30B GGUF This is a PROBE, not a hard-fail falsifier. The test prints the verdict; it does not assert. The verdict drives the next PR's investigation target. ## Cascade context - PR-1 #1713 — per-layer cos falsifier - PR-2 #1737 — q6k_gemv fp64 accumulators - PR-3 hardware-verify (manual) — 47/48 PASS, L47 surfaces - PR-3b #1739 — contract v1.7.0 → v1.7.1 (in flight) - PR-3c #1740 — scope-doc update + L47 sub-cascade (in flight) - PR-3d — H(i) qtype-mismatch FALSIFIED (#1583 comment-4470216021) - PR-3e — **this PR** — H(ii) router-weight probe - PR-3e2 — pending: capture top-k INDICES via new `SaveTensorStage::MoeRouterIndices` variant - PR-3f+ — pending: L47 fix based on PR-3e/PR-3e2 outcome ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_probe \ -- --ignored --nocapture 7s on RTX 4090 (after build). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
noahgift
added a commit
that referenced
this pull request
May 17, 2026
…sifier — H(ii) CONFIRMED (#1583) PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices` to definitively confirm or falsify H(ii) expert-set divergence at L47. L47 sorted top-8: cpu = [ 2, 20, 36, 57, 60, 73, 111, 120 ] gpu = [ 2, 12, 36, 57, 60, 103, 111, 120 ] ^^^ ^^^ cpu-only={20, 73}; gpu-only={12, 103} CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild H(ii) confirmation). All other 47 layers produce IDENTICAL expert SETS between CPU and GPU. Root cause: by L47 the accumulated post-routing drift from per-expert q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has perturbed the gate input enough that two boundary expert scores swap. The resulting FFN output diverges by O(1) because the disjoint experts produce unrelated outputs. - **Deterministic tie-breaking**: sort top-k by (-prob, +index) - **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64 - **Reorder-stable top-k**: stable partial sort + ε-tolerance on the (k+1)-th vs k-th score boundary inference_trace/save_tensor_stage.rs: + `MoeRouterIndices` enum variant + "moe_router_indices" name + `is_index_payload(&self)` helper + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs: + traced `_with_router` helpers now return `(output, weights, indices)` instead of `(output, weights)` gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU) gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA): + capture `last_router_top_k_indices` from helper + emit `MoeRouterIndices` stage (indices cast to f32, lossless for num_experts ≤ 2^24) tests/qwen3_moe_per_layer_gpu_parity.rs: + helpers `make_router_indices_plan` + `read_indices_stage_file` + new test `falsify_qw3_moe_l47_router_indices` — definitive H(ii) falsifier; captures top-k INDICES at every layer for both CPU and GPU, sorts each, asserts set equality, prints L47-specific verdict - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify (47/48 PASS, L47 surfaces) - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED - PR-3f+ pending — apply one of the 3 candidate fixes cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_indices \ -- --ignored --nocapture 29.5s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 17, 2026
…GN (#1583) (#1745) PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier, runs 4 canonical prompts through both CPU and GPU full forwards, and asserts argmax agreement. ## Result (lambda-vector RTX 4090, 2026-05-17) PROMPT | CPU argmax (val) | GPU argmax (val) canonical_3tok | 944 ( 13.7270) | 944 ( 14.4133) ✓ single_tok_785 | 220 ( 15.5523) | 25 ( 18.5098) ✗ MISMATCH multi_tok_short | 315 ( 26.2279) | 315 ( 25.5230) ✓ multi_tok_code | 198 ( 17.7453) | 198 ( 17.8433) ✓ **3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU). ## What this PR adds crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs: + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt probe that builds CPU + GPU models once, runs 4 canonical prompts through both full forwards, and prints argmax agreement table + verdict. PROBE not hard-assert; prints "BENIGN" if all agree or "NOT BENIGN" + disagreeing prompts otherwise. ## Cascade context - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify — 47/48 PASS, L47 surfaces - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap) - PR-3f1 ❌ falsified (fp64 softmax) — dropped - PR-3f2 ❌ falsified (f64 weighted-sum) — dropped - PR-3g ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix - PR-3h pending — Option C fp64 in per-expert SwiGLU intermediates ## Why the cascade kept eliminating candidates The 3-falsifier sequence ruled out the "easy" fix locations: 1. PR-3f1 (gate softmax precision) — drift upstream of softmax 2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum 3. **Remaining**: drift inside each per-expert SwiGLU's intermediate chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc which PR-2 already promoted to fp64) PR-3h must promote the silu(gate) × up element-wise multiply and the hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU and CUDA expert_swiglu helpers. ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_gpu_parity \ falsify_qw3_moe_gpu_argmax_agreement \ -- --ignored --nocapture ~25s on RTX 4090. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…sifier — H(ii) CONFIRMED (#1583) (#1743) * feat(m-gpu-moe-3): PR-3e2 MoeRouterIndices stage + L47 expert-set falsifier — H(ii) CONFIRMED (#1583) PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices` to definitively confirm or falsify H(ii) expert-set divergence at L47. L47 sorted top-8: cpu = [ 2, 20, 36, 57, 60, 73, 111, 120 ] gpu = [ 2, 12, 36, 57, 60, 103, 111, 120 ] ^^^ ^^^ cpu-only={20, 73}; gpu-only={12, 103} CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild H(ii) confirmation). All other 47 layers produce IDENTICAL expert SETS between CPU and GPU. Root cause: by L47 the accumulated post-routing drift from per-expert q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has perturbed the gate input enough that two boundary expert scores swap. The resulting FFN output diverges by O(1) because the disjoint experts produce unrelated outputs. - **Deterministic tie-breaking**: sort top-k by (-prob, +index) - **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64 - **Reorder-stable top-k**: stable partial sort + ε-tolerance on the (k+1)-th vs k-th score boundary inference_trace/save_tensor_stage.rs: + `MoeRouterIndices` enum variant + "moe_router_indices" name + `is_index_payload(&self)` helper + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs: + traced `_with_router` helpers now return `(output, weights, indices)` instead of `(output, weights)` gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU) gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA): + capture `last_router_top_k_indices` from helper + emit `MoeRouterIndices` stage (indices cast to f32, lossless for num_experts ≤ 2^24) tests/qwen3_moe_per_layer_gpu_parity.rs: + helpers `make_router_indices_plan` + `read_indices_stage_file` + new test `falsify_qw3_moe_l47_router_indices` — definitive H(ii) falsifier; captures top-k INDICES at every layer for both CPU and GPU, sorts each, asserts set equality, prints L47-specific verdict - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify (47/48 PASS, L47 surfaces) - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED - PR-3f+ pending — apply one of the 3 candidate fixes cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_indices \ -- --ignored --nocapture 29.5s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(m-gpu-moe-3): update save_tensor_plan tests for 23 stages (PR-3e2 #1583) PR-3e2 added `SaveTensorStage::MoeRouterIndices` (22 → 23 stages) but missed updating the parallel tests in `save_tensor_plan.rs` that asserted on the constant `22`. Workspace-test CI surfaced this: test inference_trace::save_tensor_plan::tests:: all_keyword_expands_to_twenty_two_stages ... FAILED test inference_trace::save_tensor_plan::tests:: all_keyword_case_insensitive ... FAILED Two fixes: 1. Rename `all_keyword_expands_to_twenty_two_stages` → `all_keyword_expands_to_all_stages` and assert against `SaveTensorStage::ALL.len()` (currently 23) instead of the hardcoded `22`. Future stage additions won't require touching this test. 2. Same change in `all_keyword_case_insensitive` — assert against `SaveTensorStage::ALL.len()` instead of `22`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-3c of the #1583 M-GPU-MOE-3 cascade. Updates
docs/specifications/aprender-gpu/m-gpu-moe-3-scope.mdto reflect the actual landed state and surface the new L47 single-layer sub-cascade.Doc-only — no code changes.
Cascade landed state (cascade map now reflects this)
Q6KGemvKernel, in-tree)L47 sub-cascade plan
Independent of:
Why this matters
The original cascade map said PR-2 would land in
../truenoand PR-3 would be "contiguous super-block chunking". Both stale:crates/aprender-gpu. PR-2 landed there.Test plan
🤖 Generated with Claude Code