feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)#1491
Merged
Merged
Conversation
eba93d0 to
8b07755
Compare
…partial discharge) Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490). WHAT THIS PR FIXES: ArchConstraints + build_indexed_weights + ValidatedLayerWeights all made MoE-aware via new `is_moe: bool` field on ArchConstraints. (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool` field to `ArchConstraints` struct. (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` — sets `is_moe: false` on all 19 dense arch entries; sets `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch string `qwen3moe` (no underscore) and `qwen3_5moe` to the same arm — these reach `from_architecture` from `ValidatedModelConfig::from_apr` without going through `normalize_architecture`. (3) `crates/aprender-serve/src/cuda/executor/weights.rs` — `build_indexed_weights` gates the 3 FFN-related quant lookups (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same gating for the 3 qtype resolutions. (4) `crates/aprender-serve/src/cuda/types.rs` — `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown role checks when `arch.is_moe`. The MoE forward path (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers` parameter, never reading these from the indexed weights. WHAT THIS PR PARTIALLY DISCHARGES: FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper construction now succeeds for qwen3_moe GGUFs. Before this PR, `OwnedQuantizedModelCuda::new(model, 0)` panicked at: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } After this PR, that specific path no longer fails. Verified by re-running M-GPU-MOE-1.2 heavy test — it now progresses past `OwnedQuantizedModelCuda::new`. NEW DOWNSTREAM BUG (not blocking this PR): After the wrapper construction fix, the heavy test now panics in CPU forward `matmul_fused.rs:211` with `index out of bounds: the len is 0 but the index is N`. This is a separate bug class: someone in the CPU forward path is dereferencing `layer.ffn_up_weight.data` (or similar) which is the `dense_ffn_placeholder` (byte_size=0) for MoE layers per `transformer.rs:348-353`. Root cause likely: the CPU `forward_qwen3_moe` does NOT touch the dense placeholders directly, but some preload/validation/init step does. Needs a follow-up PR (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE layers, or (b) replace the placeholder with proper sentinel. This PR DOES NOT regress the previous behaviour: the previous state was "wrapper construction fails", which masked the downstream bug. M-GPU-MOE-1.4 will surface and fix it. VERIFICATION: cargo check -p aprender-serve → 0 errors cargo check -p aprender-serve --features cuda → 0 errors cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda → 3 helpers pass Heavy test on lambda-vector RTX 4090: BEFORE this PR: panic at OwnedQuantizedModelCuda::new (preload_weights_gpu / build_indexed_weights) AFTER this PR: panic moved to CPU forward matmul_fused.rs:211 (downstream bug, separate PR scope) Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full PARITY-001 discharge. NOTE ON PR STACKING: This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence file) being on aprender main first. The contract pinned the architectural decision; this PR implements it. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.
Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.
- The parity gate's purpose is "stop the line if GPU diverges
from CPU" — for dense models, it's load-time safety.
- For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
(qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
forward paths and bypasses the dense path the gate runs.
- Net: MoE models lose load-time parity but gain
test-time parity via the qwen3_moe_gpu_parity test.
VERIFICATION ON LAMBDA-VECTOR RTX 4090:
Test progresses much further now:
BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
(FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
AFTER previous commit: panic at parity_gate matmul_fused.rs:211
(downstream bug — exposed but not yet fixed)
AFTER this commit: CPU forward succeeds, GPU forward executes,
then asserts at gpu_logits.iter().all(|v| v.is_finite())
because the GPU produces NaN/Inf logits.
Test output:
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
[PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
panicked at qwen3_moe_gpu_parity.rs:168:
all GPU logits must be finite (no NaN/Inf)
PARTIAL DISCHARGE:
FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
implicitly; finiteness FAILS).
FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.
NEW DOWNSTREAM BUG:
GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
#1477) produces NaN/Inf for at least the canonical 3-token
Qwen3-Coder prompt. This is the NEXT bug to investigate
(M-GPU-MOE-1.5 follow-up). Likely candidates:
- Q4K matmul accumulator overflow in expert_swiglu_cuda
- Per-expert SwiGLU silu activation produces Inf for large inputs
- Top-k router weight renormalization division by zero
- missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
loaded but not applied)
Bisection via `apr trace --json --payload` per the M32d Step 2
surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
PARITY-001 if_fails).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8b07755 to
2ebacaf
Compare
2 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…on plan (#1492) Records the next-bug-class finding from the M-GPU-MOE-1.3 partial discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender main). WHAT 1.3 DISCHARGED: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now succeeds for qwen3_moe GGUFs. WHAT 1.3 EXPOSED (NOT YET FIXED): Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails: assert!(gpu_logits.iter().all(|v| v.is_finite()), "all GPU logits must be finite (no NaN/Inf)"); Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges FALSIFY-QW3-MOE-GPU-INVARIANTS-001. BISECTION PLAN (M-GPU-MOE-1.4): (a) Instrumentation — extend apr trace --json --payload to capture per-stage tensors on MoE GPU path (b) Bisection — diff CPU vs GPU per-stage to find first NaN/Inf-producing stage. Candidates: * Q4K matvec accumulator overflow * SwiGLU silu Inf * top-k router renorm div-by-zero * missing per-head Q/K RMSNorm in MoE GPU path (c) Fix — apply at bisected stage; class TBD by result THIS PR ADDS: * v1.3.0 → v1.4.0 amendment_history block (~110 lines) * NEW M-GPU-MOE-1.4 implementation_stage (PENDING) * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED * Top-level version + status block updated VALIDATION: pv validate → 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the bisection-and-fix plan BEFORE code. Code follows in M-GPU-MOE-1.4 fix PR (separate scope). Refs: M52, M53, M54, M55, M56, R10, qwen3-moe-forward-gpu-v1 v1.4.0, FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
2 tasks
noahgift
added a commit
that referenced
this pull request
May 5, 2026
Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment. WHY THIS CONTRACT: After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37 MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate. Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a), apr trace needs new SaveTensorStage variants for MoE GPU stages. WHAT IT DEFINES: * 2 mandatory new SaveTensorStage variants: - MoeRouter — top-k weights post-softmax/renormalize - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e) * 4 optional per-expert variants (with expert_id qualifier): - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut - Promoted to mandatory only if 2-stage bisection isn't precise enough. * Bisection chain CPU-vs-GPU: cos_sequence = [ cos(CPU.ffn_norm, GPU.ffn_norm), # parent enum cos(CPU.moe_router, GPU.moe_router), # NEW cos(CPU.moe_ffn_out, GPU.moe_ffn_out), # NEW ] FALSIFICATION TESTS (4): FALSIFY-MOE-SUB-001: New variants exist + parse correctly FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name IMPLEMENTATION_STAGES (4): M-MOE-SUB-0: This contract scaffold (SHIPPED) M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING) M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING) M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING) M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING) VALIDATION: pv validate exits 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the trace-stage architecture before code lands. M-MOE-SUB-1 follows in a separate PR. Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0, trace-attn-sub-stages-v1 (sibling pattern). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment. WHY THIS CONTRACT: After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37 MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate. Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a), apr trace needs new SaveTensorStage variants for MoE GPU stages. WHAT IT DEFINES: * 2 mandatory new SaveTensorStage variants: - MoeRouter — top-k weights post-softmax/renormalize - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e) * 4 optional per-expert variants (with expert_id qualifier): - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut - Promoted to mandatory only if 2-stage bisection isn't precise enough. * Bisection chain CPU-vs-GPU: cos_sequence = [ cos(CPU.ffn_norm, GPU.ffn_norm), # parent enum cos(CPU.moe_router, GPU.moe_router), # NEW cos(CPU.moe_ffn_out, GPU.moe_ffn_out), # NEW ] FALSIFICATION TESTS (4): FALSIFY-MOE-SUB-001: New variants exist + parse correctly FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name IMPLEMENTATION_STAGES (4): M-MOE-SUB-0: This contract scaffold (SHIPPED) M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING) M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING) M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING) M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING) VALIDATION: pv validate exits 0 errors, 0 warnings. Per CLAUDE.md "NEVER write code before writing a provable contract" — this PR pins the trace-stage architecture before code lands. M-MOE-SUB-1 follows in a separate PR. Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0, trace-attn-sub-stages-v1 (sibling pattern). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements M-GPU-MOE-1.3 per
qwen3-moe-forward-gpu-v1v1.3.0 (PR #1490). Two-commit fix.What this PR fixes
gguf/config.rsis_moe: boolfield toArchConstraintsgguf/arch_constraints_fallback.rsis_moe: truefor qwen3_moe arm; add rawqwen3moe/qwen3_5moealiasescuda/executor/weights.rsbuild_indexed_weightsskips ffn_gate/up/down lookups whenarch.is_moecuda/types.rsValidatedLayerWeights::validateskips FfnGate/FfnUp/FfnDown role checks whenarch.is_moegguf/cuda/mod.rsparity_gate(Jidoka load-time gate) for MoE — runs dense forward against placeholdersTest progression on lambda-vector RTX 4090
What this PR partially discharges
New downstream bug (next iteration)
GPU forward produces NaN/Inf logits. Likely M-GPU-MOE-1.5 candidates:
expert_swiglu_cudaBisection via
apr trace --json --payloadper M32d Step 2 methodology.Stacking
This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence).
Test plan
🤖 Generated with Claude Code