feat(aprender-serve): forward_qwen3_moe_cuda full integration — M-GPU-MOE-1.1.2#1477
Merged
Merged
Conversation
6d6603f to
275f246
Compare
This was referenced May 4, 2026
Merged
…-MOE-1.1.2
Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.
Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.
Signature changes
=================
- &self → &mut self (executor needs mutable for kernel cache)
- _data → data (passed to moe_ffn_forward_layer_cuda for
expert_byte_slice)
Forward body structure (mirrors CPU sibling step-for-step):
1. Embed (CPU) — self.model.embed
2. Per-layer:
2a. Attention norm (CPU) — ops::rms_norm
2b. QKV projection (CPU) — self.model.qkv_matmul
2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b) — ops::apply_per_head_rms_norm
2d. Causal attention + output proj (CPU) — self.model.causal_attention
2e. Residual — element-wise CPU
2f. Pre-FFN norm (CPU) — ops::rms_norm
2g. **MoE FFN on GPU** — moe_ffn_forward_layer_cuda
→ expert_swiglu_cuda
→ self.executor.q4k_matvec
.q6k_gemv
2h. Residual — element-wise CPU
3. Final norm (CPU)
4. LM head — last token (CPU)
Implementation stages updated
=============================
M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓
M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓
M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (#1464)
M-GPU-MOE-1.1.0 expert_swiglu_cuda helper SHIPPED ✓ (via #1469 squash)
M-GPU-MOE-1.1.1 moe_ffn_forward_layer_cuda SHIPPED ✓ (#1469)
M-GPU-MOE-1.1.2 forward_qwen3_moe_cuda full integ SHIPPED ✓ (THIS PR)
M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING
(FALSIFY-QW3-MOE-GPU-PARITY-001)
M-GPU-MOE-2 wgpu fallback PENDING
M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING
Verification
============
$ cargo check -p aprender-serve --features cuda
✓ Compiles
$ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
test ... ok. 1 passed
Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
275f246 to
1f49eac
Compare
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…99 falsifier Authors the FALSIFY-QW3-MOE-GPU-PARITY-001 test scaffold from contract qwen3-moe-forward-gpu-v1 v1.1.0 implementation_stages M-GPU-MOE-1.2. WHAT THE TEST DOES (when run with `--include-ignored` against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf on RTX 4090): 1. Loads the GGUF once (single mmap). 2. Builds moe_layers: Vec<Qwen3MoeQuantizedLayer> once. 3. Builds CPU OwnedQuantizedModel #1 → runs forward_qwen3_moe on a fixed prompt → cpu_logits (the LAZY-FUSED-MATVEC ground truth). 4. Builds CPU OwnedQuantizedModel #2 → wraps into OwnedQuantizedModelCuda → runs forward_qwen3_moe_cuda on the same prompt → gpu_logits. 5. Computes cosine_similarity(cpu_logits, gpu_logits) over the full 151936-dim vocab. 6. Asserts cos_sim ≥ 0.99 per the contract's formal bound. The test follows the qwen3_moe_parity.rs (M32d.2 CPU-vs-HF-FP16) template line-for-line — same canonical GGUF paths array, same fixture-skip pattern, same cosine_similarity helper. The only difference is the second forward pass dispatches to forward_qwen3_moe_cuda instead of treating an FP32 fixture as truth. CI WIRING: - #[cfg(feature = "cuda")] gates the entire file (no GPU host = no compile) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it) - 3 helper unit tests (cosine_similarity_unit_vectors / handles_zero / within_threshold) DO run by default — they cover the cosine helper itself WHEN THE TEST PASSES: - The aprender PR #1477 (M-GPU-MOE-1.1.2 full forward integration) must be on main first. Currently main has the v1.0-redo stub; running this test against the stub returns UnsupportedOperation error and the test panics (correct behaviour for a falsifier against an incomplete impl). - Once #1477 lands, run the test on lambda-vector with: cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda -- --include-ignored - On PASS, the contract's M-GPU-MOE-1.2 stage flips PENDING → SHIPPED and (with PARITY-002 from the v1 sibling) the gate discharges qwen3-moe-forward-gpu-v1 v1.1.0 DRAFT → ACTIVE_ALGORITHM_LEVEL. PR #1477 changes forward_qwen3_moe_cuda's receiver from `&self` to `&mut self` (kernel cache mutation). The `mut gpu_model` binding here carries a forward-looking #[allow(unused_mut)] note for that reason. Refs: qwen3-moe-forward-gpu-v1 v1.1.0 :: M-GPU-MOE-1.2 + FALSIFY-QW3-MOE-GPU-PARITY-001 + companion-spec M51 + R10. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…99 falsifier Authors the FALSIFY-QW3-MOE-GPU-PARITY-001 test scaffold from contract qwen3-moe-forward-gpu-v1 v1.1.0 implementation_stages M-GPU-MOE-1.2. WHAT THE TEST DOES (when run with `--include-ignored` against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf on RTX 4090): 1. Loads the GGUF once (single mmap). 2. Builds moe_layers: Vec<Qwen3MoeQuantizedLayer> once. 3. Builds CPU OwnedQuantizedModel #1 → runs forward_qwen3_moe on a fixed prompt → cpu_logits (the LAZY-FUSED-MATVEC ground truth). 4. Builds CPU OwnedQuantizedModel #2 → wraps into OwnedQuantizedModelCuda → runs forward_qwen3_moe_cuda on the same prompt → gpu_logits. 5. Computes cosine_similarity(cpu_logits, gpu_logits) over the full 151936-dim vocab. 6. Asserts cos_sim ≥ 0.99 per the contract's formal bound. The test follows the qwen3_moe_parity.rs (M32d.2 CPU-vs-HF-FP16) template line-for-line — same canonical GGUF paths array, same fixture-skip pattern, same cosine_similarity helper. The only difference is the second forward pass dispatches to forward_qwen3_moe_cuda instead of treating an FP32 fixture as truth. CI WIRING: - #[cfg(feature = "cuda")] gates the entire file (no GPU host = no compile) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it) - 3 helper unit tests (cosine_similarity_unit_vectors / handles_zero / within_threshold) DO run by default — they cover the cosine helper itself WHEN THE TEST PASSES: - The aprender PR #1477 (M-GPU-MOE-1.1.2 full forward integration) must be on main first. Currently main has the v1.0-redo stub; running this test against the stub returns UnsupportedOperation error and the test panics (correct behaviour for a falsifier against an incomplete impl). - Once #1477 lands, run the test on lambda-vector with: cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda -- --include-ignored - On PASS, the contract's M-GPU-MOE-1.2 stage flips PENDING → SHIPPED and (with PARITY-002 from the v1 sibling) the gate discharges qwen3-moe-forward-gpu-v1 v1.1.0 DRAFT → ACTIVE_ALGORITHM_LEVEL. PR #1477 changes forward_qwen3_moe_cuda's receiver from `&self` to `&mut self` (kernel cache mutation). The `mut gpu_model` binding here carries a forward-looking #[allow(unused_mut)] note for that reason. Refs: qwen3-moe-forward-gpu-v1 v1.1.0 :: M-GPU-MOE-1.2 + FALSIFY-QW3-MOE-GPU-PARITY-001 + companion-spec M51 + R10. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…99 falsifier (#1484) Authors the FALSIFY-QW3-MOE-GPU-PARITY-001 test scaffold from contract qwen3-moe-forward-gpu-v1 v1.1.0 implementation_stages M-GPU-MOE-1.2. WHAT THE TEST DOES (when run with `--include-ignored` against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf on RTX 4090): 1. Loads the GGUF once (single mmap). 2. Builds moe_layers: Vec<Qwen3MoeQuantizedLayer> once. 3. Builds CPU OwnedQuantizedModel #1 → runs forward_qwen3_moe on a fixed prompt → cpu_logits (the LAZY-FUSED-MATVEC ground truth). 4. Builds CPU OwnedQuantizedModel #2 → wraps into OwnedQuantizedModelCuda → runs forward_qwen3_moe_cuda on the same prompt → gpu_logits. 5. Computes cosine_similarity(cpu_logits, gpu_logits) over the full 151936-dim vocab. 6. Asserts cos_sim ≥ 0.99 per the contract's formal bound. The test follows the qwen3_moe_parity.rs (M32d.2 CPU-vs-HF-FP16) template line-for-line — same canonical GGUF paths array, same fixture-skip pattern, same cosine_similarity helper. The only difference is the second forward pass dispatches to forward_qwen3_moe_cuda instead of treating an FP32 fixture as truth. CI WIRING: - #[cfg(feature = "cuda")] gates the entire file (no GPU host = no compile) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it) - 3 helper unit tests (cosine_similarity_unit_vectors / handles_zero / within_threshold) DO run by default — they cover the cosine helper itself WHEN THE TEST PASSES: - The aprender PR #1477 (M-GPU-MOE-1.1.2 full forward integration) must be on main first. Currently main has the v1.0-redo stub; running this test against the stub returns UnsupportedOperation error and the test panics (correct behaviour for a falsifier against an incomplete impl). - Once #1477 lands, run the test on lambda-vector with: cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda -- --include-ignored - On PASS, the contract's M-GPU-MOE-1.2 stage flips PENDING → SHIPPED and (with PARITY-002 from the v1 sibling) the gate discharges qwen3-moe-forward-gpu-v1 v1.1.0 DRAFT → ACTIVE_ALGORITHM_LEVEL. PR #1477 changes forward_qwen3_moe_cuda's receiver from `&self` to `&mut self` (kernel cache mutation). The `mut gpu_model` binding here carries a forward-looking #[allow(unused_mut)] note for that reason. Refs: qwen3-moe-forward-gpu-v1 v1.1.0 :: M-GPU-MOE-1.2 + FALSIFY-QW3-MOE-GPU-PARITY-001 + companion-spec M51 + R10. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…arity test (#1485) * contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487) Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488) wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484). Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu` integration on the same prompt. Same falsifier ID as the cuda sibling (FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend implementing the same contract gate, not a different gate. Same threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same 3-token canonical prompt as the cuda test. CI WIRING: - #[cfg(feature = "gpu")] gates the file (matches the gate on OwnedQuantizedModelWgpu in gguf/mod.rs) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it on a wgpu-capable adapter — Apple Silicon Metal, AMD Vulkan, Intel ARC Vulkan) - 2 helper unit tests (cosine_similarity sanity coverage) DO run by default WHEN THE TEST PASSES: - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test currently panics at the wgpu forward call (correct behaviour for a falsifier against an incomplete impl). - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2 (full forward integration analog of forward_qwen3_moe_cuda) must both land before this test passes on hardware. - On hardware with wgpu support, run with --include-ignored to exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for the wgpu backend (cuda backend discharged by sibling test). DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR's base flips to main automatically. Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…d-bug fix plan
Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:
UnsupportedOperation { operation: "preload_weights_gpu",
reason: "PAR-043: Failed to build indexed weights:
Invalid launch config: Quantized weight
'blk.0.ffn_gate.weight' not cached" }
ROOT CAUSE (5-whys in evidence file):
`executor.build_indexed_weights` at
`crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
unconditionally requires `blk.{i}.ffn_gate.weight`,
`.ffn_up.weight`, `.ffn_down.weight` to be cached for every
layer. For MoE these names DO NOT EXIST — MoE has 128 expert
gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
the `moe_layers` parameter at forward-time.
M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
weights for FFN, but the wrapper construction goes through
`preload_weights_gpu` BEFORE forward is ever called. Wrapper
construction fails first.
WHY DEFAULT CI DIDN'T CATCH IT:
Lib-only stub test (PR #1464) only checks signature at compile
time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
+ needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
dogfood on lambda-vector found this 2026-05-04.
THIS PR ADDS:
(1) Evidence file
`evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
documenting the live failure + 5-whys + fix architecture.
(2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
* New v1.3.0 amendment_history block (~110 lines) describing
the bug, root cause, and three-step fix architecture
* New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
with status PENDING
* New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
(hardware test + lib-only sibling)
* Top-level version "1.2.0" → "1.3.0"
* Status comment expanded to mention M-GPU-MOE-1.3 as a
precondition for ACTIVE_ALGORITHM_LEVEL flip
VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
→ 0 errors, 0 warnings. Contract is valid.
WHAT THIS PR DOES NOT DO:
Does NOT implement the fix. Per CLAUDE.md "NEVER write code
before writing a provable contract", this PR pins the contract
first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
drift-prevention test.
Does NOT block PR #1485's already-shipped 3-commit cascade
(M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
bug-fix.
Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.
Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.
- The parity gate's purpose is "stop the line if GPU diverges
from CPU" — for dense models, it's load-time safety.
- For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
(qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
forward paths and bypasses the dense path the gate runs.
- Net: MoE models lose load-time parity but gain
test-time parity via the qwen3_moe_gpu_parity test.
VERIFICATION ON LAMBDA-VECTOR RTX 4090:
Test progresses much further now:
BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
(FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
AFTER previous commit: panic at parity_gate matmul_fused.rs:211
(downstream bug — exposed but not yet fixed)
AFTER this commit: CPU forward succeeds, GPU forward executes,
then asserts at gpu_logits.iter().all(|v| v.is_finite())
because the GPU produces NaN/Inf logits.
Test output:
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
[PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
panicked at qwen3_moe_gpu_parity.rs:168:
all GPU logits must be finite (no NaN/Inf)
PARTIAL DISCHARGE:
FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
implicitly; finiteness FAILS).
FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.
NEW DOWNSTREAM BUG:
GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
#1477) produces NaN/Inf for at least the canonical 3-token
Qwen3-Coder prompt. This is the NEXT bug to investigate
(M-GPU-MOE-1.5 follow-up). Likely candidates:
- Q4K matmul accumulator overflow in expert_swiglu_cuda
- Per-expert SwiGLU silu activation produces Inf for large inputs
- Top-k router weight renormalization division by zero
- missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
loaded but not applied)
Bisection via `apr trace --json --payload` per the M32d Step 2
surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
PARITY-001 if_fails).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.
Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.
- The parity gate's purpose is "stop the line if GPU diverges
from CPU" — for dense models, it's load-time safety.
- For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
(qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
forward paths and bypasses the dense path the gate runs.
- Net: MoE models lose load-time parity but gain
test-time parity via the qwen3_moe_gpu_parity test.
VERIFICATION ON LAMBDA-VECTOR RTX 4090:
Test progresses much further now:
BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
(FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
AFTER previous commit: panic at parity_gate matmul_fused.rs:211
(downstream bug — exposed but not yet fixed)
AFTER this commit: CPU forward succeeds, GPU forward executes,
then asserts at gpu_logits.iter().all(|v| v.is_finite())
because the GPU produces NaN/Inf logits.
Test output:
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
[PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
panicked at qwen3_moe_gpu_parity.rs:168:
all GPU logits must be finite (no NaN/Inf)
PARTIAL DISCHARGE:
FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
implicitly; finiteness FAILS).
FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.
NEW DOWNSTREAM BUG:
GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
#1477) produces NaN/Inf for at least the canonical 3-token
Qwen3-Coder prompt. This is the NEXT bug to investigate
(M-GPU-MOE-1.5 follow-up). Likely candidates:
- Q4K matmul accumulator overflow in expert_swiglu_cuda
- Per-expert SwiGLU silu activation produces Inf for large inputs
- Top-k router weight renormalization division by zero
- missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
loaded but not applied)
Bisection via `apr trace --json --payload` per the M32d Step 2
surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
PARITY-001 if_fails).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…d-bug fix plan
Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:
UnsupportedOperation { operation: "preload_weights_gpu",
reason: "PAR-043: Failed to build indexed weights:
Invalid launch config: Quantized weight
'blk.0.ffn_gate.weight' not cached" }
ROOT CAUSE (5-whys in evidence file):
`executor.build_indexed_weights` at
`crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
unconditionally requires `blk.{i}.ffn_gate.weight`,
`.ffn_up.weight`, `.ffn_down.weight` to be cached for every
layer. For MoE these names DO NOT EXIST — MoE has 128 expert
gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
the `moe_layers` parameter at forward-time.
M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
weights for FFN, but the wrapper construction goes through
`preload_weights_gpu` BEFORE forward is ever called. Wrapper
construction fails first.
WHY DEFAULT CI DIDN'T CATCH IT:
Lib-only stub test (PR #1464) only checks signature at compile
time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
+ needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
dogfood on lambda-vector found this 2026-05-04.
THIS PR ADDS:
(1) Evidence file
`evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
documenting the live failure + 5-whys + fix architecture.
(2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
* New v1.3.0 amendment_history block (~110 lines) describing
the bug, root cause, and three-step fix architecture
* New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
with status PENDING
* New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
(hardware test + lib-only sibling)
* Top-level version "1.2.0" → "1.3.0"
* Status comment expanded to mention M-GPU-MOE-1.3 as a
precondition for ACTIVE_ALGORITHM_LEVEL flip
VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
→ 0 errors, 0 warnings. Contract is valid.
WHAT THIS PR DOES NOT DO:
Does NOT implement the fix. Per CLAUDE.md "NEVER write code
before writing a provable contract", this PR pins the contract
first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
drift-prevention test.
Does NOT block PR #1485's already-shipped 3-commit cascade
(M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
bug-fix.
Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…d-bug fix plan (#1490) Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE- GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF because `OwnedQuantizedModelCuda::new` itself fails: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } ROOT CAUSE (5-whys in evidence file): `executor.build_indexed_weights` at `crates/aprender-serve/src/cuda/executor/weights.rs:325-373` unconditionally requires `blk.{i}.ffn_gate.weight`, `.ffn_up.weight`, `.ffn_down.weight` to be cached for every layer. For MoE these names DO NOT EXIST — MoE has 128 expert gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into the `moe_layers` parameter at forward-time. M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed weights for FFN, but the wrapper construction goes through `preload_weights_gpu` BEFORE forward is ever called. Wrapper construction fails first. WHY DEFAULT CI DIDN'T CATCH IT: Lib-only stub test (PR #1464) only checks signature at compile time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored` dogfood on lambda-vector found this 2026-05-04. THIS PR ADDS: (1) Evidence file `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md` documenting the live failure + 5-whys + fix architecture. (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0: * New v1.3.0 amendment_history block (~110 lines) describing the bug, root cause, and three-step fix architecture * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2 with status PENDING * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001 (hardware test + lib-only sibling) * Top-level version "1.2.0" → "1.3.0" * Status comment expanded to mention M-GPU-MOE-1.3 as a precondition for ACTIVE_ALGORITHM_LEVEL flip VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 errors, 0 warnings. Contract is valid. WHAT THIS PR DOES NOT DO: Does NOT implement the fix. Per CLAUDE.md "NEVER write code before writing a provable contract", this PR pins the contract first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage): ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field + drift-prevention test. Does NOT block PR #1485's already-shipped 3-commit cascade (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling bug-fix. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.
Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.
- The parity gate's purpose is "stop the line if GPU diverges
from CPU" — for dense models, it's load-time safety.
- For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
(qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
forward paths and bypasses the dense path the gate runs.
- Net: MoE models lose load-time parity but gain
test-time parity via the qwen3_moe_gpu_parity test.
VERIFICATION ON LAMBDA-VECTOR RTX 4090:
Test progresses much further now:
BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
(FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
AFTER previous commit: panic at parity_gate matmul_fused.rs:211
(downstream bug — exposed but not yet fixed)
AFTER this commit: CPU forward succeeds, GPU forward executes,
then asserts at gpu_logits.iter().all(|v| v.is_finite())
because the GPU produces NaN/Inf logits.
Test output:
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
[PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
panicked at qwen3_moe_gpu_parity.rs:168:
all GPU logits must be finite (no NaN/Inf)
PARTIAL DISCHARGE:
FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
implicitly; finiteness FAILS).
FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.
NEW DOWNSTREAM BUG:
GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
#1477) produces NaN/Inf for at least the canonical 3-token
Qwen3-Coder prompt. This is the NEXT bug to investigate
(M-GPU-MOE-1.5 follow-up). Likely candidates:
- Q4K matmul accumulator overflow in expert_swiglu_cuda
- Per-expert SwiGLU silu activation produces Inf for large inputs
- Top-k router weight renormalization division by zero
- missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
loaded but not applied)
Bisection via `apr trace --json --payload` per the M32d Step 2
surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
PARITY-001 if_fails).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…partial discharge) (#1491) * feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge) Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490). WHAT THIS PR FIXES: ArchConstraints + build_indexed_weights + ValidatedLayerWeights all made MoE-aware via new `is_moe: bool` field on ArchConstraints. (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool` field to `ArchConstraints` struct. (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` — sets `is_moe: false` on all 19 dense arch entries; sets `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch string `qwen3moe` (no underscore) and `qwen3_5moe` to the same arm — these reach `from_architecture` from `ValidatedModelConfig::from_apr` without going through `normalize_architecture`. (3) `crates/aprender-serve/src/cuda/executor/weights.rs` — `build_indexed_weights` gates the 3 FFN-related quant lookups (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same gating for the 3 qtype resolutions. (4) `crates/aprender-serve/src/cuda/types.rs` — `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown role checks when `arch.is_moe`. The MoE forward path (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers` parameter, never reading these from the indexed weights. WHAT THIS PR PARTIALLY DISCHARGES: FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper construction now succeeds for qwen3_moe GGUFs. Before this PR, `OwnedQuantizedModelCuda::new(model, 0)` panicked at: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } After this PR, that specific path no longer fails. Verified by re-running M-GPU-MOE-1.2 heavy test — it now progresses past `OwnedQuantizedModelCuda::new`. NEW DOWNSTREAM BUG (not blocking this PR): After the wrapper construction fix, the heavy test now panics in CPU forward `matmul_fused.rs:211` with `index out of bounds: the len is 0 but the index is N`. This is a separate bug class: someone in the CPU forward path is dereferencing `layer.ffn_up_weight.data` (or similar) which is the `dense_ffn_placeholder` (byte_size=0) for MoE layers per `transformer.rs:348-353`. Root cause likely: the CPU `forward_qwen3_moe` does NOT touch the dense placeholders directly, but some preload/validation/init step does. Needs a follow-up PR (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE layers, or (b) replace the placeholder with proper sentinel. This PR DOES NOT regress the previous behaviour: the previous state was "wrapper construction fails", which masked the downstream bug. M-GPU-MOE-1.4 will surface and fix it. VERIFICATION: cargo check -p aprender-serve → 0 errors cargo check -p aprender-serve --features cuda → 0 errors cargo test -p aprender-serve --test qwen3_moe_gpu_parity \ --features cuda → 3 helpers pass Heavy test on lambda-vector RTX 4090: BEFORE this PR: panic at OwnedQuantizedModelCuda::new (preload_weights_gpu / build_indexed_weights) AFTER this PR: panic moved to CPU forward matmul_fused.rs:211 (downstream bug, separate PR scope) Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full PARITY-001 discharge. NOTE ON PR STACKING: This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence file) being on aprender main first. The contract pinned the architectural decision; this PR implements it. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): M-GPU-MOE-1.3 — also skip parity_gate for MoE Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate (Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`) also runs the dense forward paths (`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on construction. For MoE these dispatch to `fused_matmul_f32` against the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel panics in `matmul_fused.rs:211`. Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale already in v1.3.0's amendment_history block. - The parity gate's purpose is "stop the line if GPU diverges from CPU" — for dense models, it's load-time safety. - For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001 (qwen3_moe_gpu_parity.rs), which exercises the MoE-specific forward paths and bypasses the dense path the gate runs. - Net: MoE models lose load-time parity but gain test-time parity via the qwen3_moe_gpu_parity test. VERIFICATION ON LAMBDA-VECTOR RTX 4090: Test progresses much further now: BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights (FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier) AFTER previous commit: panic at parity_gate matmul_fused.rs:211 (downstream bug — exposed but not yet fixed) AFTER this commit: CPU forward succeeds, GPU forward executes, then asserts at gpu_logits.iter().all(|v| v.is_finite()) because the GPU produces NaN/Inf logits. Test output: [GH-129] Early kernel preload: 49 modules compiled [PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048) [PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB) FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward... panicked at qwen3_moe_gpu_parity.rs:168: all GPU logits must be finite (no NaN/Inf) PARTIAL DISCHARGE: FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds. FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK implicitly; finiteness FAILS). FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug. NEW DOWNSTREAM BUG: GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR #1477) produces NaN/Inf for at least the canonical 3-token Qwen3-Coder prompt. This is the NEXT bug to investigate (M-GPU-MOE-1.5 follow-up). Likely candidates: - Q4K matmul accumulator overflow in expert_swiglu_cuda - Per-expert SwiGLU silu activation produces Inf for large inputs - Top-k router weight renormalization division by zero - missing per-head Q/K RMSNorm path for MoE (qk_norm tensors loaded but not applied) Bisection via `apr trace --json --payload` per the M32d Step 2 surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0 PARITY-001 if_fails). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Signature changes
Per qwen3-moe-forward-gpu-v1 v1.1.0 option D
Extends OwnedQuantizedModelCuda's CPU-attention + CUDA-FFN pattern. The actual GPU compute happens at the per-expert SwiGLU dispatch (q4k_matvec × 2 + q6k_gemv per top-k expert per token).
Test plan
🤖 Generated with Claude Code