feat(aprender-gpu): q6k_gemv fp64 accumulators — M-GPU-MOE-3 PR-2 (#1583) by noahgift · Pull Request #1737 · paiml/aprender

noahgift · 2026-05-17T07:30:54Z

Summary

PR-2 of the #1583 M-GPU-MOE-3 cascade. Closes the 7-layer MoE FFN-out cosine regression (L7 / L9 / L12 / L20 / L23 / L29 / L46 sitting at 0.94–0.987 vs CPU fused_q6k_parallel_matvec) by promoting the q6k_gemv per-lane accumulator path to fp64, mirroring the GH-561 pattern already shipped in nf4_rmsnorm_gemv and the NF4 GEMM kernel.

PR-1 of this cascade (#1713) shipped the per-layer cosine falsifier this fix is targeted at.

Kernel diff (`crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs`)

acc and thread_partial are now mov_f64_imm_zero (was mov_f32_imm).
Inner 8-FMA-per-thread loop uses fma_f64_acc_inplace: x_val and dequant (both f32) are promoted to f64 inside the helper; thread_partial stays f64 across iterations.
Per-superblock add_f64_inplace(acc, thread_partial) (new helper).
Final cvt.rn.f32.f64 just before the warp-reduce — the 5 shfl.sync.down.b32 adds and st.global.f32 stay f32 (shfl is 32-bit only).

Builder addition (`crates/aprender-gpu/src/ptx/builder/inplace_ops.rs`)

New add_f64_inplace(dst, src): dst += src, both f64, RN. Pairs with the existing GH-561: Fix CUDA inference — our PTX works via Python but fails via Rust #561 helpers (mov_f64_imm_zero, fma_f64_acc_inplace, cvt_f32_f64_rn).

Falsification test

falsify_m_gpu_moe_3_pr2_kernel_emits_fp64_accumulators asserts the emitted PTX contains:

fma.f64 for the per-lane FMA loop
add.f64 for the thread_partial → acc accumulation
mov.f64 / mov.b64 for the f64 accumulator init
cvt.rn.f32.f64 downcast before warp-reduce
shfl.sync.down.b32 warp-reduce (unchanged)
st.global.f32 final store (unchanged)

If this drifts back to all-.f32 the 7-layer regression returns.

Test plan

82/82 q6k unit tests pass (81 existing + this new emit assertion)
No q6k regressions
rustfmt --check clean on touched files
Follow-up PR-3: run cosine measurement on lambda-labs / gx10 against the 7 problem MoE layers — cos ≥ 0.99 target
Follow-up PR-4+: throughput tuning to ≥ 150 tok/s on RTX 4090, VRAM ≤ 95%

Cost / precision tradeoff

Extra cvt.rn.f64.f32 per FMA + f64 add per superblock.
Amortised over the ~64 superblocks × 8-FMA-per-thread, fp64 precision win dominates.
nf4_rmsnorm_gemv and NF4 GEMM already pay this cost shape with no measurable throughput regression on sm_89 / sm_121.

🤖 Generated with Claude Code

) Closes 7-layer MoE cosine regression (L7/L9/L12/L20/L23/L29/L46 at 0.94–0.987 vs CPU fused_q6k_parallel_matvec) by promoting the per-lane accumulator path in q6k_gemv to f64, mirroring the GH-561 pattern already shipped in nf4_rmsnorm_gemv and the NF4 GEMM kernel. Kernel changes (`crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs`): - `acc` and `thread_partial` are now `mov_f64_imm_zero` (was `mov_f32_imm`) - Inner 8-FMA-per-thread loop uses `fma_f64_acc_inplace`: x_val + dequant (both f32) are promoted to f64 inside the helper; thread_partial stays f64 across iterations - Per-superblock `add_f64_inplace(acc, thread_partial)` (new helper) - Final `cvt.rn.f32.f64` just before the warp-reduce — the 5 shfl-down-add reductions stay f32 (shfl.sync.down.b32 is the only primitive we expose) and the final `st.global.f32` is unchanged - Cost: ~16K extra cvt.rn.f64.f32 + ~64 f64 adds per output element; amortised over the ~64 superblocks of FMAs the precision win dominates Builder addition (`crates/aprender-gpu/src/ptx/builder/inplace_ops.rs`): - New `add_f64_inplace(dst, src)` — `dst += src`, both f64, round-to-nearest. Pairs with the existing `mov_f64_imm_zero`, `fma_f64_acc_inplace`, and `cvt_f32_f64_rn` GH-561 helpers. Falsification test: - `falsify_m_gpu_moe_3_pr2_kernel_emits_fp64_accumulators` asserts the emitted PTX contains `fma.f64`, `add.f64`, `mov.f64`/`mov.b64`, the `cvt.rn.f32.f64` downcast, the unchanged `shfl.sync.down.b32` warp-reduce, and the final `st.global.f32`. If this drifts back to all-`.f32` the 7-layer regression returns. Test status (host without CUDA driver): - 82/82 q6k unit tests pass (81 existing + this new emit assertion) - 0 q6k regressions - Broader cargo-test failures are pre-existing CUDA-driver tests (cublas/cuda_graph/driver/memory_fuzz) requiring a live GPU — not touched by this PR - Cosine measurement on the 7 problem MoE layers is the next cascade-step (PR-3) and runs on lambda-labs / gx10 References: - GH-561 (fp32 accumulator order divergence — same root cause class) - #1583 (M-GPU-MOE-3 umbrella) - PR-1 of this cascade was #1713 (per-layer CPU vs GPU MoE FFN out cosine falsifier) — this PR ships the kernel fix that falsifier was authored to validate Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-3 PR-2 verified, L47 surfaced (#1739) Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main (#1737, 88ce47f — q6k_gemv fp64 accumulators). PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in GitHub comment on #1583 (issuecomment-4470195446). The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46, v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a real win. L47 was previously undetected because no per-layer falsifier existed in-tree; PR-1 of this cascade (#1713) closed that gap and surfaced the L47 anomaly. WHAT FLIPS: metadata.version 1.7.0 → 1.7.1 bottom-of-file version: "1.7.0" → "1.7.1" bottom-of-file status comment refreshed: "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING" → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING" AC_GPU_MOE_001 stage status text refresh (text-only — not yet refactored into a new amendment_history entry since this PR is scoped to the v1.7.1 amendment block only). WHAT STAYS PENDING: - L47 single-layer cascade — root cause unknown. Three candidate hypotheses captured in the v1.7.1 amendment block (qtype mismatch, MoE expert distribution, stride/shape boundary). Forthcoming PR-3c surfaces §85 (or next-available section) covering the L47 cascade. Forthcoming PR-3d+: per-tensor histogram on L47 before authoring fix. - M-GPU-MOE-2 (wgpu fallback) — unchanged - M-GPU-MOE-3 PR-4 throughput — unchanged YAML-ONLY: Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…e surfaced (#1583) (#1740) PR-3c of the M-GPU-MOE-3 cascade. Updates m-gpu-moe-3-scope.md with the actual landed state and the new L47 single-layer sub-cascade. WHAT CHANGED Cascade table now reflects shipped PRs: PR-1 ✅ shipped (#1713) — per-layer cos falsifier PR-2 ✅ shipped (#1737) — fp64 accumulators in Q6KGemvKernel. Note: in-tree at crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs (the original "../trueno" reference was stale after the monorepo consolidation subsumed trueno-gpu). PR-3 ✅ ran (manual hardware verification on lambda-vector RTX 4090, 2026-05-17) — 47/48 layers cos ≥ 0.99, L47 alone at cos=0.961236. Evidence in #1583 comment-4470195446. PR-3b ✅ shipped (#1739) — contract v1.7.0 → v1.7.1. PR-3c ✅ this update. PR-3d ✅ ran — H(i) qtype-mismatch FALSIFIED. apr tensors shows L0, L46, L47 have identical shapes + qtypes. Evidence in #1583 comment-4470216021. New sub-cascade for L47: PR-3e — pending: routing-divergence falsifier for H(ii). Hypothesis: per-layer cosine is ACCUMULATED drift, not per-kernel divergence. By L47 the CPU-vs-GPU hidden state has drifted by ~0.002. If that drift straddles a top-k expert boundary at L47, CPU and GPU pick different expert sets and the FFN output diverges by O(1) — matching the 0.961 cliff. The falsifier extends SaveTensorStage::MoeRouter (or adds a sibling stage) to persist top-k EXPERT INDICES alongside the weights, then asserts CPU index set == GPU index set at L47. PR-3f+ — pending: L47 fix based on PR-3e outcome. - If H(ii) confirmed: deterministic tie-breaking in expert ordering OR fp64 MoE gate softmax OR f64 expert selection with f32 post-conversion. - If H(ii) dead: per-expert weight cancellation pathology investigation (capture FfnGate + FfnUp + FfnSwigl at L47). Parallel work: PR-4 (throughput ≥150 tok/s + VRAM ≤95%) — independent of L47 sub-cascade. PR-5 (contract v1.7.1 → v1.8.0 ACTIVE_RUNTIME) — gates on PR-3f+ AND PR-4. REPRODUCTION cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ -- --ignored --nocapture 27.92s on RTX 4090. Test source: crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs DOC-ONLY PR No code changes. Production hot paths byte-unchanged. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…GN (#1583) PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier, runs 4 canonical prompts through both CPU and GPU full forwards, and asserts argmax agreement. ## Result (lambda-vector RTX 4090, 2026-05-17) PROMPT | CPU argmax (val) | GPU argmax (val) canonical_3tok | 944 ( 13.7270) | 944 ( 14.4133) ✓ single_tok_785 | 220 ( 15.5523) | 25 ( 18.5098) ✗ MISMATCH multi_tok_short | 315 ( 26.2279) | 315 ( 25.5230) ✓ multi_tok_code | 198 ( 17.7453) | 198 ( 17.8433) ✓ **3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU). ## What this PR adds crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs: + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt probe that builds CPU + GPU models once, runs 4 canonical prompts through both full forwards, and prints argmax agreement table + verdict. PROBE not hard-assert; prints "BENIGN" if all agree or "NOT BENIGN" + disagreeing prompts otherwise. ## Cascade context - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify — 47/48 PASS, L47 surfaces - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap) - PR-3f1 ❌ falsified (fp64 softmax) — dropped - PR-3f2 ❌ falsified (f64 weighted-sum) — dropped - PR-3g ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix - PR-3h pending — Option C fp64 in per-expert SwiGLU intermediates ## Why the cascade kept eliminating candidates The 3-falsifier sequence ruled out the "easy" fix locations: 1. PR-3f1 (gate softmax precision) — drift upstream of softmax 2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum 3. **Remaining**: drift inside each per-expert SwiGLU's intermediate chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc which PR-2 already promoted to fp64) PR-3h must promote the silu(gate) × up element-wise multiply and the hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU and CUDA expert_swiglu helpers. ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_gpu_parity \ falsify_qw3_moe_gpu_argmax_agreement \ -- --ignored --nocapture ~25s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…router weights diverge (#1583) (#1741) PR-3e of the M-GPU-MOE-3 cascade. Adds `falsify_qw3_moe_l47_router_probe` to disambiguate H(ii) routing-divergence from post-routing divergence at L47. ## Result (lambda-vector RTX 4090, 2026-05-17) L## | MoeRouter | MoeFfnOut L00 | 1.000000 | 0.999999 L01 | 1.000000 | 1.000000 ... L46 | 1.000000 | 0.998498 L47 | 0.992558 | 0.961236 <-- FfnOut BELOW 0.99 **Dispositive evidence:** L0..L46 router cos = 1.000000 (byte-identical between CPU and GPU). L47 router cos = 0.992558 — the FIRST and ONLY layer where router weights diverge. This is a sharp transition, exactly the signature of an accumulated-drift threshold being crossed at the L47 softmax/top-k boundary. ## Verdict H(ii) routing-divergence is INCONCLUSIVE from weights alone (router cos 0.992558 is between 0.99 and 0.995). The saved MoeRouter tensor is `[k=8]` post-softmax+renormalize WEIGHTS in descending order — if CPU and GPU pick DIFFERENT 8 experts, both vectors are still sorted descending and cos can be near 1.0 even with disjoint sets. Indices are required to definitively confirm or falsify SET divergence. PR-3e2 will add `SaveTensorStage::MoeRouterIndices` to persist the top-k INDICES alongside the weights. Once indices are captured, the test can assert CPU set == GPU set at L47 to lock in H(ii). ## What's in this PR crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs: + new helper `make_router_and_ffn_out_plan` capturing both `moe_router` and `moe_ffn_out` stages for all 48 layers + new test `falsify_qw3_moe_l47_router_probe` printing the per-layer router+ffn_out cos vector side-by-side and a verdict line classifying H(ii) status (FALSIFIED / ALIVE / INCONCLUSIVE) + same `#[ignore]` + `#![cfg(feature = "cuda")]` gates as the existing falsifier — runs only on RTX 4090 with the cached 30B GGUF This is a PROBE, not a hard-fail falsifier. The test prints the verdict; it does not assert. The verdict drives the next PR's investigation target. ## Cascade context - PR-1 #1713 — per-layer cos falsifier - PR-2 #1737 — q6k_gemv fp64 accumulators - PR-3 hardware-verify (manual) — 47/48 PASS, L47 surfaces - PR-3b #1739 — contract v1.7.0 → v1.7.1 (in flight) - PR-3c #1740 — scope-doc update + L47 sub-cascade (in flight) - PR-3d — H(i) qtype-mismatch FALSIFIED (#1583 comment-4470216021) - PR-3e — **this PR** — H(ii) router-weight probe - PR-3e2 — pending: capture top-k INDICES via new `SaveTensorStage::MoeRouterIndices` variant - PR-3f+ — pending: L47 fix based on PR-3e/PR-3e2 outcome ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_probe \ -- --ignored --nocapture 7s on RTX 4090 (after build). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN (#1747) * docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.0 → v1.7.1 — M-GPU-MOE-3 PR-2 verified, L47 surfaced Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main (#1737, 88ce47f — q6k_gemv fp64 accumulators). PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in GitHub comment on #1583 (issuecomment-4470195446). The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46, v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a real win. L47 was previously undetected because no per-layer falsifier existed in-tree; PR-1 of this cascade (#1713) closed that gap and surfaced the L47 anomaly. WHAT FLIPS: metadata.version 1.7.0 → 1.7.1 bottom-of-file version: "1.7.0" → "1.7.1" bottom-of-file status comment refreshed: "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING" → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING" AC_GPU_MOE_001 stage status text refresh (text-only — not yet refactored into a new amendment_history entry since this PR is scoped to the v1.7.1 amendment block only). WHAT STAYS PENDING: - L47 single-layer cascade — root cause unknown. Three candidate hypotheses captured in the v1.7.1 amendment block (qtype mismatch, MoE expert distribution, stride/shape boundary). Forthcoming PR-3c surfaces §85 (or next-available section) covering the L47 cascade. Forthcoming PR-3d+: per-tensor histogram on L47 before authoring fix. - M-GPU-MOE-2 (wgpu fallback) — unchanged - M-GPU-MOE-3 PR-4 throughput — unchanged YAML-ONLY: Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade. After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root cause and verify user-visible impact. OUTCOME PR-3 ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236 PR-3d ❌ H(i) qtype-mismatch FALSIFIED PR-3e ✅ #1741 — L47 first divergent router (cos 0.9926) PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47 PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream PR-3g ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees → L47 NOT BENIGN (~25% prompt-dependent impact) ROOT CAUSE (by elimination) Per-expert SwiGLU f32 intermediates: 1. gate_proj @ hidden ← fp64 acc thanks to PR-2 ✅ 2. silu(gate) ← f32 ✗ 3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗ 4. down_proj @ above ← fp64 acc thanks to PR-2 ✅ Fix scope = PR-3h: promote silu × up multiply + intermediate state to f64 in both expert_swiglu_quantized (CPU, simple) and expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU kernel). Multi-week kernel work. STATUS FLIPS metadata.version: 1.7.1 → 1.7.2 metadata.status: ACTIVE_ALGORITHM_LEVEL (unchanged) AC_GPU_MOE_001: 47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47 KNOWN_DIVERGENCE_NOT_BENIGN WHAT STAYS PENDING - PR-3h fp64 per-expert SwiGLU (multi-week) - M-GPU-MOE-2 wgpu fallback (#1582) - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked by this amendment) WHY NOT KNOWN_BUG L47 is a numerical-precision artifact, not a correctness bug. CPU and GPU follow the same algorithm against the same weights; only the order of f32 accumulation inside the per-expert SwiGLU differs. Both pick legitimate top-8 sets at L47 — neither is wrong — but the small score-perturbation crosses a top-k boundary. Same class as gemv reduction-order variance, one call-stack level higher. REGRESSION GATE FOR PR-3h - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted top-8 == GPU L47 sorted top-8 - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts argmax agreement YAML-ONLY Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…sifier — H(ii) CONFIRMED (#1583) PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices` to definitively confirm or falsify H(ii) expert-set divergence at L47. L47 sorted top-8: cpu = [ 2, 20, 36, 57, 60, 73, 111, 120 ] gpu = [ 2, 12, 36, 57, 60, 103, 111, 120 ] ^^^ ^^^ cpu-only={20, 73}; gpu-only={12, 103} CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild H(ii) confirmation). All other 47 layers produce IDENTICAL expert SETS between CPU and GPU. Root cause: by L47 the accumulated post-routing drift from per-expert q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has perturbed the gate input enough that two boundary expert scores swap. The resulting FFN output diverges by O(1) because the disjoint experts produce unrelated outputs. - **Deterministic tie-breaking**: sort top-k by (-prob, +index) - **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64 - **Reorder-stable top-k**: stable partial sort + ε-tolerance on the (k+1)-th vs k-th score boundary inference_trace/save_tensor_stage.rs: + `MoeRouterIndices` enum variant + "moe_router_indices" name + `is_index_payload(&self)` helper + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs: + traced `_with_router` helpers now return `(output, weights, indices)` instead of `(output, weights)` gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU) gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA): + capture `last_router_top_k_indices` from helper + emit `MoeRouterIndices` stage (indices cast to f32, lossless for num_experts ≤ 2^24) tests/qwen3_moe_per_layer_gpu_parity.rs: + helpers `make_router_indices_plan` + `read_indices_stage_file` + new test `falsify_qw3_moe_l47_router_indices` — definitive H(ii) falsifier; captures top-k INDICES at every layer for both CPU and GPU, sorts each, asserts set equality, prints L47-specific verdict - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify (47/48 PASS, L47 surfaces) - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED - PR-3f+ pending — apply one of the 3 candidate fixes cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_indices \ -- --ignored --nocapture 29.5s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…GN (#1583) (#1745) PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually user-visible" falsifier, runs 4 canonical prompts through both CPU and GPU full forwards, and asserts argmax agreement. ## Result (lambda-vector RTX 4090, 2026-05-17) PROMPT | CPU argmax (val) | GPU argmax (val) canonical_3tok | 944 ( 13.7270) | 944 ( 14.4133) ✓ single_tok_785 | 220 ( 15.5523) | 25 ( 18.5098) ✗ MISMATCH multi_tok_short | 315 ( 26.2279) | 315 ( 25.5230) ✓ multi_tok_code | 198 ( 17.7453) | 198 ( 17.8433) ✓ **3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the expert-set divergence DOES flip the top-1 predicted token for some prompts (~25% in this small sample). Option E (Accept) is off the table; must pursue Option C (fp64 in per-expert SwiGLU). ## What this PR adds crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs: + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt probe that builds CPU + GPU models once, runs 4 canonical prompts through both full forwards, and prints argmax agreement table + verdict. PROBE not hard-assert; prints "BENIGN" if all agree or "NOT BENIGN" + disagreeing prompts otherwise. ## Cascade context - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify — 47/48 PASS, L47 surfaces - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap) - PR-3f1 ❌ falsified (fp64 softmax) — dropped - PR-3f2 ❌ falsified (f64 weighted-sum) — dropped - PR-3g ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix - PR-3h pending — Option C fp64 in per-expert SwiGLU intermediates ## Why the cascade kept eliminating candidates The 3-falsifier sequence ruled out the "easy" fix locations: 1. PR-3f1 (gate softmax precision) — drift upstream of softmax 2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum 3. **Remaining**: drift inside each per-expert SwiGLU's intermediate chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc which PR-2 already promoted to fp64) PR-3h must promote the silu(gate) × up element-wise multiply and the hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU and CUDA expert_swiglu helpers. ## Reproduction cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_gpu_parity \ falsify_qw3_moe_gpu_argmax_agreement \ -- --ignored --nocapture ~25s on RTX 4090. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…rward_qwen3_moe (#1749) (#1751) Closes #1749. Pre-fix, `apr bench` against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense `forward_single_with_cache` path which calls `matmul_fused.rs:211` on tensor names that don't exist on MoE models (the 3D `*_exps` tensors are stored at different names than the 2D dense `ffn_{gate,up,down}.weight` the dense path looks up). Result: hundreds of parallel thread panics — `index out of bounds: len=0 but index ≈ 91M`. This PR adds MoE detection via `gguf.expert_count().is_some()` and routes to the MoE-aware forward path: CPU: realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda Both helpers do not currently expose a KV cache, so the bench runs them **autoregressively with re-prefill** — each iteration runs full forward over `prompt + previously-generated tokens` and appends the argmax to the prompt for the next iter. O(N²) in N tokens but bounded by `--max-tokens` (default 32). This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4 throughput measurement. True KV-cache MoE decoding is the actual PR-4 work; this PR makes `apr bench` produce a real (if pessimistic) tok/s number for MoE GGUFs instead of panicking. ## Empirical (lambda-vector RTX 4090, 2026-05-17) apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ --max-tokens 8 --warmup 1 --iterations 4 --json → total_time_ms: 87085 ; total_tokens: 4 → 0.046 tok/s effective (auto-regressive re-prefill cost dominates) → 22.8s ttft, 22.2s p50, 32.5s p99 The 0.046 tok/s is the upper bound on what `apr bench` can currently measure for MoE without KV cache. PR-4's job is to add the cache and push this to ≥ 150 tok/s. ## What's in this PR crates/apr-cli/src/commands/bench_moe.rs (new): - `is_moe_gguf(&GGUFModel)` predicate - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU bench helper. - `run_cuda_moe_benchmark` — autoregressive forward_qwen3_moe_cuda + greedy argmax decode. - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe. crates/apr-cli/src/commands/bench.rs: + `include!("bench_moe.rs")` after the existing `include!("bench_safetensors.rs")` (same pattern as the other bench sub-files). crates/apr-cli/src/commands/benchmark.rs: + In `run_gguf_benchmark`, after parsing the GGUF and tokenising the prompt, check `is_moe_gguf(&gguf)`. If true, log the detection (`expert_count` + top-k) and tail-call `run_gguf_moe_benchmark`. Otherwise fall through to the existing dense path. ## What's NOT in this PR - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target) - Streaming/per-token JSON output for MoE (existing JSON output works; just reflects the autoregressive re-prefill cost) - MoE bench for SafeTensors / APR formats (only GGUF MoE supported today; the other formats don't have MoE production paths in the realizar inference engine) ## Cross-refs - #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this - #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) — L47 known-divergence + cascade pause point - The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly, which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this bench measures *post-fp64-acc* throughput, not the pre-fix path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…sifier — H(ii) CONFIRMED (#1583) (#1743) * feat(m-gpu-moe-3): PR-3e2 MoeRouterIndices stage + L47 expert-set falsifier — H(ii) CONFIRMED (#1583) PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices` to definitively confirm or falsify H(ii) expert-set divergence at L47. L47 sorted top-8: cpu = [ 2, 20, 36, 57, 60, 73, 111, 120 ] gpu = [ 2, 12, 36, 57, 60, 103, 111, 120 ] ^^^ ^^^ cpu-only={20, 73}; gpu-only={12, 103} CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild H(ii) confirmation). All other 47 layers produce IDENTICAL expert SETS between CPU and GPU. Root cause: by L47 the accumulated post-routing drift from per-expert q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has perturbed the gate input enough that two boundary expert scores swap. The resulting FFN output diverges by O(1) because the disjoint experts produce unrelated outputs. - **Deterministic tie-breaking**: sort top-k by (-prob, +index) - **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64 - **Reorder-stable top-k**: stable partial sort + ε-tolerance on the (k+1)-th vs k-th score boundary inference_trace/save_tensor_stage.rs: + `MoeRouterIndices` enum variant + "moe_router_indices" name + `is_index_payload(&self)` helper + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs: + traced `_with_router` helpers now return `(output, weights, indices)` instead of `(output, weights)` gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU) gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA): + capture `last_router_top_k_indices` from helper + emit `MoeRouterIndices` stage (indices cast to f32, lossless for num_experts ≤ 2^24) tests/qwen3_moe_per_layer_gpu_parity.rs: + helpers `make_router_indices_plan` + `read_indices_stage_file` + new test `falsify_qw3_moe_l47_router_indices` — definitive H(ii) falsifier; captures top-k INDICES at every layer for both CPU and GPU, sorts each, asserts set equality, prints L47-specific verdict - PR-1 #1713 ✅ per-layer cos falsifier - PR-2 #1737 ✅ q6k_gemv fp64 accumulators - PR-3 ✅ hardware verify (47/48 PASS, L47 surfaces) - PR-3b #1739 ✅ contract v1.7.0 → v1.7.1 - PR-3c #1740 ✅ scope-doc + L47 sub-cascade - PR-3d ✅ H(i) qtype-mismatch FALSIFIED - PR-3e #1741 ✅ router-weight probe - PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED - PR-3f+ pending — apply one of the 3 candidate fixes cargo test --release --features cuda \ -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \ falsify_qw3_moe_l47_router_indices \ -- --ignored --nocapture 29.5s on RTX 4090. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(m-gpu-moe-3): update save_tensor_plan tests for 23 stages (PR-3e2 #1583) PR-3e2 added `SaveTensorStage::MoeRouterIndices` (22 → 23 stages) but missed updating the parallel tests in `save_tensor_plan.rs` that asserted on the constant `22`. Workspace-test CI surfaced this: test inference_trace::save_tensor_plan::tests:: all_keyword_expands_to_twenty_two_stages ... FAILED test inference_trace::save_tensor_plan::tests:: all_keyword_case_insensitive ... FAILED Two fixes: 1. Rename `all_keyword_expands_to_twenty_two_stages` → `all_keyword_expands_to_all_stages` and assert against `SaveTensorStage::ALL.len()` (currently 23) instead of the hardcoded `22`. Future stage additions won't require touching this test. 2. Same change in `all_keyword_case_insensitive` — assert against `SaveTensorStage::ALL.len()` instead of `22`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 17, 2026 07:30

noahgift added 2 commits May 17, 2026 09:31

Merge branch 'main' into feat/m-gpu-moe-3-pr2-q6k-fp64-acc

a8d9899

Merge branch 'main' into feat/m-gpu-moe-3-pr2-q6k-fp64-acc

4dbcacd

noahgift merged commit 88ce47f into main May 17, 2026
10 checks passed

noahgift deleted the feat/m-gpu-moe-3-pr2-q6k-fp64-acc branch May 17, 2026 08:18

noahgift mentioned this pull request May 17, 2026

test(m-gpu-moe-3): PR-3e router-weight probe — L47 first layer where router weights diverge (#1583) #1741

Merged

4 tasks

noahgift mentioned this pull request May 17, 2026

feat(m-gpu-moe-3): PR-3e2 MoeRouterIndices stage + L47 expert-set falsifier — H(ii) CONFIRMED (#1583) #1743

Merged

4 tasks

noahgift mentioned this pull request May 17, 2026

test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583) #1745

Merged

5 tasks

noahgift mentioned this pull request May 17, 2026

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN #1747

Merged

2 tasks

noahgift mentioned this pull request May 17, 2026

defect: apr bench panics on MoE GGUF models (matmul_fused.rs:211 index OOB) #1749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-gpu): q6k_gemv fp64 accumulators — M-GPU-MOE-3 PR-2 (#1583)#1737

feat(aprender-gpu): q6k_gemv fp64 accumulators — M-GPU-MOE-3 PR-2 (#1583)#1737
noahgift merged 3 commits into
mainfrom
feat/m-gpu-moe-3-pr2-q6k-fp64-acc

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Kernel diff (crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs)

Builder addition (crates/aprender-gpu/src/ptx/builder/inplace_ops.rs)

Falsification test

Test plan

Cost / precision tradeoff

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kernel diff (`crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs`)

Builder addition (`crates/aprender-gpu/src/ptx/builder/inplace_ops.rs`)