feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge) by noahgift · Pull Request #1491 · paiml/aprender

noahgift · 2026-05-04T22:08:51Z

Summary

Implements M-GPU-MOE-1.3 per qwen3-moe-forward-gpu-v1 v1.3.0 (PR #1490). Two-commit fix.

What this PR fixes

File	Change
`gguf/config.rs`	Add `is_moe: bool` field to `ArchConstraints`
`gguf/arch_constraints_fallback.rs`	Set `is_moe: true` for qwen3_moe arm; add raw `qwen3moe`/`qwen3_5moe` aliases
`cuda/executor/weights.rs`	`build_indexed_weights` skips ffn_gate/up/down lookups when `arch.is_moe`
`cuda/types.rs`	`ValidatedLayerWeights::validate` skips FfnGate/FfnUp/FfnDown role checks when `arch.is_moe`
`gguf/cuda/mod.rs`	Skip `parity_gate` (Jidoka load-time gate) for MoE — runs dense forward against placeholders

Test progression on lambda-vector RTX 4090

BEFORE this PR:
  panic at OwnedQuantizedModelCuda::new
  → build_indexed_weights demands `blk.0.ffn_gate.weight` (doesn't exist for MoE)

AFTER commit 1 (build_indexed_weights + ValidatedLayerWeights gates):
  panic at parity_gate (matmul_fused.rs:211)
  → dense forward indexes `layer.ffn_up_weight.data` (placeholder, byte_size=0)

AFTER commit 2 (parity_gate MoE guard):
  ✅ CPU forward succeeds (forward_qwen3_moe LAZY-FUSED-MATVEC)
  ✅ OwnedQuantizedModelCuda construction succeeds  
  ✅ GPU forward executes (forward_qwen3_moe_cuda)
  ❌ Asserts at gpu_logits.iter().all(|v| v.is_finite())
     → GPU produces NaN/Inf (separate downstream numerical bug)

What this PR partially discharges

FALSIFY-QW3-MOE-GPU-PRELOAD-001 ✅ — wrapper construction succeeds (was the original bug)
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 ⚠️ — partial (output length OK; finiteness FAILS due to NaN/Inf)
FALSIFY-QW3-MOE-GPU-PARITY-001 ❌ — blocked by downstream NaN/Inf bug

New downstream bug (next iteration)

GPU forward produces NaN/Inf logits. Likely M-GPU-MOE-1.5 candidates:

Q4K matmul accumulator overflow in expert_swiglu_cuda
SwiGLU silu producing Inf for large inputs
Top-k router weight renormalization div-by-zero
Missing per-head Q/K RMSNorm in MoE GPU path

Bisection via apr trace --json --payload per M32d Step 2 methodology.

Stacking

This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence).

Test plan

All compile combos clean
3 cosine-helper unit tests pass
Heavy test progresses to GPU forward execution (was: failed at construction)
CI ci/gate green
PR contract+evidence(qwen3-moe-forward-gpu-v1): v1.3.0 — preload-bug fix plan #1490 lands first

🤖 Generated with Claude Code

…partial discharge) Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490). WHAT THIS PR FIXES: ArchConstraints + build_indexed_weights + ValidatedLayerWeights all made MoE-aware via new `is_moe: bool` field on ArchConstraints. (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool` field to `ArchConstraints` struct. (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` — sets `is_moe: false` on all 19 dense arch entries; sets `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch string `qwen3moe` (no underscore) and `qwen3_5moe` to the same arm — these reach `from_architecture` from `ValidatedModelConfig::from_apr` without going through `normalize_architecture`. (3) `crates/aprender-serve/src/cuda/executor/weights.rs` — `build_indexed_weights` gates the 3 FFN-related quant lookups (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same gating for the 3 qtype resolutions. (4) `crates/aprender-serve/src/cuda/types.rs` — `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown role checks when `arch.is_moe`. The MoE forward path (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers` parameter, never reading these from the indexed weights. WHAT THIS PR PARTIALLY DISCHARGES: FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper construction now succeeds for qwen3_moe GGUFs. Before this PR, `OwnedQuantizedModelCuda::new(model, 0)` panicked at: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } After this PR, that specific path no longer fails. Verified by re-running M-GPU-MOE-1.2 heavy test — it now progresses past `OwnedQuantizedModelCuda::new`. NEW DOWNSTREAM BUG (not blocking this PR): After the wrapper construction fix, the heavy test now panics in CPU forward `matmul_fused.rs:211` with `index out of bounds: the len is 0 but the index is N`. This is a separate bug class: someone in the CPU forward path is dereferencing `layer.ffn_up_weight.data` (or similar) which is the `dense_ffn_placeholder` (byte_size=0) for MoE layers per `transformer.rs:348-353`. Root cause likely: the CPU `forward_qwen3_moe` does NOT touch the dense placeholders directly, but some preload/validation/init step does. Needs a follow-up PR (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE layers, or (b) replace the placeholder with proper sentinel. This PR DOES NOT regress the previous behaviour: the previous state was "wrapper construction fails", which masked the downstream bug. M-GPU-MOE-1.4 will surface and fix it. VERIFICATION: cargo check -p aprender-serve → 0 errors cargo check -p aprender-serve --features cuda → 0 errors cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda → 3 helpers pass Heavy test on lambda-vector RTX 4090: BEFORE this PR: panic at OwnedQuantizedModelCuda::new (preload_weights_gpu / build_indexed_weights) AFTER this PR: panic moved to CPU forward matmul_fused.rs:211 (downstream bug, separate PR scope) Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full PARITY-001 discharge. NOTE ON PR STACKING: This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence file) being on aprender main first. The contract pinned the architectural decision; this PR implements it. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate (Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`) also runs the dense forward paths (`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on construction. For MoE these dispatch to `fused_matmul_f32` against the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel panics in `matmul_fused.rs:211`. Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale already in v1.3.0's amendment_history block. - The parity gate's purpose is "stop the line if GPU diverges from CPU" — for dense models, it's load-time safety. - For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001 (qwen3_moe_gpu_parity.rs), which exercises the MoE-specific forward paths and bypasses the dense path the gate runs. - Net: MoE models lose load-time parity but gain test-time parity via the qwen3_moe_gpu_parity test. VERIFICATION ON LAMBDA-VECTOR RTX 4090: Test progresses much further now: BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights (FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier) AFTER previous commit: panic at parity_gate matmul_fused.rs:211 (downstream bug — exposed but not yet fixed) AFTER this commit: CPU forward succeeds, GPU forward executes, then asserts at gpu_logits.iter().all(|v| v.is_finite()) because the GPU produces NaN/Inf logits. Test output: [GH-129] Early kernel preload: 49 modules compiled [PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048) [PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB) FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward... panicked at qwen3_moe_gpu_parity.rs:168: all GPU logits must be finite (no NaN/Inf) PARTIAL DISCHARGE: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds. FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK implicitly; finiteness FAILS). FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug. NEW DOWNSTREAM BUG: GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR #1477) produces NaN/Inf for at least the canonical 3-token Qwen3-Coder prompt. This is the NEXT bug to investigate (M-GPU-MOE-1.5 follow-up). Likely candidates: - Q4K matmul accumulator overflow in expert_swiglu_cuda - Per-expert SwiGLU silu activation produces Inf for large inputs - Top-k router weight renormalization division by zero - missing per-head Q/K RMSNorm path for MoE (qk_norm tensors loaded but not applied) Bisection via `apr trace --json --payload` per the M32d Step 2 surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0 PARITY-001 if_fails). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…on plan (#1492) Records the next-bug-class finding from the M-GPU-MOE-1.3 partial discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender main). WHAT 1.3 DISCHARGED: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs. WHAT 1.3 EXPOSED (NOT YET FIXED): Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails: assert!(gpu_logits.iter().all(|v| v.is_finite()), "all GPU logits must be finite (no NaN/Inf)"); Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges FALSIFY-QW3-MOE-GPU-INVARIANTS-001. BISECTION PLAN (M-GPU-MOE-1.4): (a) Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path (b) Bisection — diff CPU vs GPU per-stage to find first NaN/Inf-producing stage. Candidates: * Q4K matvec accumulator overflow * SwiGLU silu Inf * top-k router renorm div-by-zero * missing per-head Q/K RMSNorm in MoE GPU path (c) Fix — apply at bisected stage; class TBD by result THIS PR ADDS: * v1.3.0 → v1.4.0 amendment_history block (~110 lines) * NEW M-GPU-MOE-1.4 implementation_stage (PENDING) * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED * Top-level version + status block updated VALIDATION: pv validate → 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the bisection-and-fix plan BEFORE code. Code follows in M-GPU-MOE-1.4 fix PR (separate scope). Refs: M52, M53, M54, M55, M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment. WHY THIS CONTRACT: After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37 MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate. Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a), apr trace needs new SaveTensorStage variants for MoE GPU stages. WHAT IT DEFINES: * 2 mandatory new SaveTensorStage variants: - MoeRouter — top-k weights post-softmax/renormalize - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e) * 4 optional per-expert variants (with expert_id qualifier): - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut - Promoted to mandatory only if 2-stage bisection isn't precise enough. * Bisection chain CPU-vs-GPU: cos_sequence = [ cos(CPU.ffn_norm, GPU.ffn_norm), # parent enum cos(CPU.moe_router, GPU.moe_router), # NEW cos(CPU.moe_ffn_out, GPU.moe_ffn_out), # NEW ] FALSIFICATION TESTS (4): FALSIFY-MOE-SUB-001: New variants exist + parse correctly FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name IMPLEMENTATION_STAGES (4): M-MOE-SUB-0: This contract scaffold (SHIPPED) M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING) M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING) M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING) M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING) VALIDATION: pv validate exits 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the trace-stage architecture before code lands. M-MOE-SUB-1 follows in a separate PR. Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0, trace-attn-sub-stages-v1 (sibling pattern). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment. WHY THIS CONTRACT: After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37 MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate. Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a), apr trace needs new SaveTensorStage variants for MoE GPU stages. WHAT IT DEFINES: * 2 mandatory new SaveTensorStage variants: - MoeRouter — top-k weights post-softmax/renormalize - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e) * 4 optional per-expert variants (with expert_id qualifier): - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut - Promoted to mandatory only if 2-stage bisection isn't precise enough. * Bisection chain CPU-vs-GPU: cos_sequence = [ cos(CPU.ffn_norm, GPU.ffn_norm), # parent enum cos(CPU.moe_router, GPU.moe_router), # NEW cos(CPU.moe_ffn_out, GPU.moe_ffn_out), # NEW ] FALSIFICATION TESTS (4): FALSIFY-MOE-SUB-001: New variants exist + parse correctly FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name IMPLEMENTATION_STAGES (4): M-MOE-SUB-0: This contract scaffold (SHIPPED) M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING) M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING) M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING) M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING) VALIDATION: pv validate exits 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the trace-stage architecture before code lands. M-MOE-SUB-1 follows in a separate PR. Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0, trace-attn-sub-stages-v1 (sibling pattern). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 22:08

noahgift force-pushed the feat/m-gpu-moe-1-3-preload-bug-fix branch from eba93d0 to 8b07755 Compare May 4, 2026 22:28

noahgift and others added 2 commits May 5, 2026 00:50

noahgift force-pushed the feat/m-gpu-moe-1-3-preload-bug-fix branch from 8b07755 to 2ebacaf Compare May 4, 2026 22:50

noahgift merged commit f0cbe37 into main May 4, 2026
10 checks passed

noahgift deleted the feat/m-gpu-moe-1-3-preload-bug-fix branch May 4, 2026 23:09

noahgift mentioned this pull request May 4, 2026

contract(qwen3-moe-forward-gpu-v1): v1.3.0 → v1.4.0 — NaN/Inf bisection plan #1492

Merged

2 tasks

noahgift mentioned this pull request May 5, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.0.0 PROPOSED scaffold for M-GPU-MOE-1.4 bisection #1498

Merged

2 tasks

noahgift mentioned this pull request May 6, 2026

contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL #1530

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)#1491

feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)#1491
noahgift merged 2 commits into
mainfrom
feat/m-gpu-moe-1-3-preload-bug-fix

noahgift commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR fixes

Test progression on lambda-vector RTX 4090

What this PR partially discharges

New downstream bug (next iteration)

Stacking

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

noahgift commented May 4, 2026 •

edited

Loading