feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type by noahgift · Pull Request #1464 · paiml/aprender

noahgift · 2026-05-04T09:52:06Z

Summary

New `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda` method per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5) #1462)
Companion to wrong-type stub from feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460 — places the GPU MoE function on the correct type
Same precondition validation, same UnsupportedOperation stub semantics
1 unit test (signature drift gate)

Why this PR

PR #1460 placed `forward_qwen3_moe_gpu` on `OwnedQuantizedModel` (the CPU-only type). Code archaeology after #1460 landed showed `OwnedQuantizedModelCuda` already exists with `forward_cuda` doing CPU-attention + CUDA-FFN — the established pattern this contract should extend. The v1.1.0 amendment (#1462) recorded option D as the architectural decision. This PR ships the implementation choice.

Test plan

`cargo check -p aprender-serve --features cuda` — compiles
`cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda` — passes
`pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` — 0/0
M-GPU-MOE-1.1 PR: per-expert CUDA dispatch via `self.executor` (separate)

🤖 Generated with Claude Code

…correct type Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462 squash 4495407), the GPU MoE forward path lives on OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the wrong type; this PR ships the redo on the correct type. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs NEW impl OwnedQuantizedModelCuda { pub fn forward_qwen3_moe_cuda( &self, token_ids: &[u32], moe_layers: &[Qwen3MoeQuantizedLayer], num_experts: usize, num_experts_per_tok: usize, moe_intermediate: usize, _data: &[u8], ) -> Result<Vec<f32>> } Behavior at M-GPU-MOE-1.0-redo: 1. Validate preconditions (token_ids non-empty, moe_layers length matches self.model.layers.len(), num_experts/num_experts_per_tok/ moe_intermediate > 0, num_experts_per_tok ≤ num_experts). 2. Return RealizarError::UnsupportedOperation pointing at qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages M-GPU-MOE-1.1+. + 1 unit test (signature drift gate) + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");` Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel) ========================================================= Per the v1.1.0 amendment's option D decision: this method must extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs at line 18), not invent a new substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers (embed_buf, prefix_cache). Naming follows existing precedent: `forward_cuda` is the existing method on this type, so `forward_qwen3_moe_cuda` slots in cleanly. Wrong-type stub (#1460) status ============================== The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460 remains on main. It returns the same UnsupportedOperation but on the wrong type. A separate cleanup PR can either delete it or update its doc-comment to point at this new variant. Not blocking. Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (THIS PR) M-GPU-MOE-1.1 Per-expert CUDA dispatch via PENDING self.executor (gemm_q4k for gate/up, gemm_q6k for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 (P0 elevation) Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED) Refs M32b precedent (CPU sibling staging: stub → forward impl) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1469) * feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-MOE-1.1.2 Replaces the M-GPU-MOE-1.0-redo stub body with the full forward integration. forward_qwen3_moe_cuda now mirrors the CPU sibling OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs) line-for-line, with one difference: the per-layer FFN section routes through moe_ffn_forward_layer_cuda which dispatches per- expert matmuls to self.executor (CudaExecutor) via the expert_swiglu_cuda helper. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel for ~5× throughput. Signature changes ================= - &self → &mut self (executor needs mutable for kernel cache) - _data → data (passed to moe_ffn_forward_layer_cuda for expert_byte_slice) Forward body structure (mirrors CPU sibling step-for-step): 1. Embed (CPU) — self.model.embed 2. Per-layer: 2a. Attention norm (CPU) — ops::rms_norm 2b. QKV projection (CPU) — self.model.qkv_matmul 2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b) — ops::apply_per_head_rms_norm 2d. Causal attention + output proj (CPU) — self.model.causal_attention 2e. Residual — element-wise CPU 2f. Pre-FFN norm (CPU) — ops::rms_norm 2g. **MoE FFN on GPU** — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → self.executor.q4k_matvec .q6k_gemv 2h. Residual — element-wise CPU 3. Final norm (CPU) 4. LM head — last token (CPU) Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (#1464) M-GPU-MOE-1.1.0 expert_swiglu_cuda helper SHIPPED ✓ (via #1469 squash) M-GPU-MOE-1.1.1 moe_ffn_forward_layer_cuda SHIPPED ✓ (#1469) M-GPU-MOE-1.1.2 forward_qwen3_moe_cuda full integ SHIPPED ✓ (THIS PR) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed Refs PR #1469 squash 77b9f0d (helpers landed) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-MOE-1.1.2 (#1477) Replaces the M-GPU-MOE-1.0-redo stub body with the full forward integration. forward_qwen3_moe_cuda now mirrors the CPU sibling OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs) line-for-line, with one difference: the per-layer FFN section routes through moe_ffn_forward_layer_cuda which dispatches per- expert matmuls to self.executor (CudaExecutor) via the expert_swiglu_cuda helper. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel for ~5× throughput. Signature changes ================= - &self → &mut self (executor needs mutable for kernel cache) - _data → data (passed to moe_ffn_forward_layer_cuda for expert_byte_slice) Forward body structure (mirrors CPU sibling step-for-step): 1. Embed (CPU) — self.model.embed 2. Per-layer: 2a. Attention norm (CPU) — ops::rms_norm 2b. QKV projection (CPU) — self.model.qkv_matmul 2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b) — ops::apply_per_head_rms_norm 2d. Causal attention + output proj (CPU) — self.model.causal_attention 2e. Residual — element-wise CPU 2f. Pre-FFN norm (CPU) — ops::rms_norm 2g. **MoE FFN on GPU** — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → self.executor.q4k_matvec .q6k_gemv 2h. Residual — element-wise CPU 3. Final norm (CPU) 4. LM head — last token (CPU) Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (#1464) M-GPU-MOE-1.1.0 expert_swiglu_cuda helper SHIPPED ✓ (via #1469 squash) M-GPU-MOE-1.1.1 moe_ffn_forward_layer_cuda SHIPPED ✓ (#1469) M-GPU-MOE-1.1.2 forward_qwen3_moe_cuda full integ SHIPPED ✓ (THIS PR) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed Refs PR #1469 squash 77b9f0d (helpers landed) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1487) Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1487) Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…arity test (#1485) * contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487) Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488) wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484). Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu` integration on the same prompt. Same falsifier ID as the cuda sibling (FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend implementing the same contract gate, not a different gate. Same threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same 3-token canonical prompt as the cuda test. CI WIRING: - #[cfg(feature = "gpu")] gates the file (matches the gate on OwnedQuantizedModelWgpu in gguf/mod.rs) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it on a wgpu-capable adapter — Apple Silicon Metal, AMD Vulkan, Intel ARC Vulkan) - 2 helper unit tests (cosine_similarity sanity coverage) DO run by default WHEN THE TEST PASSES: - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test currently panics at the wgpu forward call (correct behaviour for a falsifier against an incomplete impl). - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2 (full forward integration analog of forward_qwen3_moe_cuda) must both land before this test passes on hardware. - On hardware with wgpu support, run with --include-ignored to exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for the wgpu backend (cuda backend discharged by sibling test). DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR's base flips to main automatically. Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…d-bug fix plan Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE- GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF because `OwnedQuantizedModelCuda::new` itself fails: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } ROOT CAUSE (5-whys in evidence file): `executor.build_indexed_weights` at `crates/aprender-serve/src/cuda/executor/weights.rs:325-373` unconditionally requires `blk.{i}.ffn_gate.weight`, `.ffn_up.weight`, `.ffn_down.weight` to be cached for every layer. For MoE these names DO NOT EXIST — MoE has 128 expert gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into the `moe_layers` parameter at forward-time. M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed weights for FFN, but the wrapper construction goes through `preload_weights_gpu` BEFORE forward is ever called. Wrapper construction fails first. WHY DEFAULT CI DIDN'T CATCH IT: Lib-only stub test (PR #1464) only checks signature at compile time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored` dogfood on lambda-vector found this 2026-05-04. THIS PR ADDS: (1) Evidence file `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md` documenting the live failure + 5-whys + fix architecture. (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0: * New v1.3.0 amendment_history block (~110 lines) describing the bug, root cause, and three-step fix architecture * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2 with status PENDING * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001 (hardware test + lib-only sibling) * Top-level version "1.2.0" → "1.3.0" * Status comment expanded to mention M-GPU-MOE-1.3 as a precondition for ACTIVE_ALGORITHM_LEVEL flip VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 errors, 0 warnings. Contract is valid. WHAT THIS PR DOES NOT DO: Does NOT implement the fix. Per CLAUDE.md "NEVER write code before writing a provable contract", this PR pins the contract first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage): ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field + drift-prevention test. Does NOT block PR #1485's already-shipped 3-commit cascade (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling bug-fix. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d-bug fix plan (#1490) Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE- GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF because `OwnedQuantizedModelCuda::new` itself fails: UnsupportedOperation { operation: "preload_weights_gpu", reason: "PAR-043: Failed to build indexed weights: Invalid launch config: Quantized weight 'blk.0.ffn_gate.weight' not cached" } ROOT CAUSE (5-whys in evidence file): `executor.build_indexed_weights` at `crates/aprender-serve/src/cuda/executor/weights.rs:325-373` unconditionally requires `blk.{i}.ffn_gate.weight`, `.ffn_up.weight`, `.ffn_down.weight` to be cached for every layer. For MoE these names DO NOT EXIST — MoE has 128 expert gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into the `moe_layers` parameter at forward-time. M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed weights for FFN, but the wrapper construction goes through `preload_weights_gpu` BEFORE forward is ever called. Wrapper construction fails first. WHY DEFAULT CI DIDN'T CATCH IT: Lib-only stub test (PR #1464) only checks signature at compile time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored` dogfood on lambda-vector found this 2026-05-04. THIS PR ADDS: (1) Evidence file `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md` documenting the live failure + 5-whys + fix architecture. (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0: * New v1.3.0 amendment_history block (~110 lines) describing the bug, root cause, and three-step fix architecture * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2 with status PENDING * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001 (hardware test + lib-only sibling) * Top-level version "1.2.0" → "1.3.0" * Status comment expanded to mention M-GPU-MOE-1.3 as a precondition for ACTIVE_ALGORITHM_LEVEL flip VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 errors, 0 warnings. Contract is valid. WHAT THIS PR DOES NOT DO: Does NOT implement the fix. Per CLAUDE.md "NEVER write code before writing a provable contract", this PR pins the contract first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage): ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field + drift-prevention test. Does NOT block PR #1485's already-shipped 3-commit cascade (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling bug-fix. Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0, FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 09:52

Merge branch 'main' into feat/qwen3-moe-forward-cuda-m-stage-1-0-redo

7436847

noahgift merged commit 9c721ec into main May 4, 2026
10 checks passed

noahgift deleted the feat/qwen3-moe-forward-cuda-m-stage-1-0-redo branch May 4, 2026 12:23

noahgift mentioned this pull request May 4, 2026

docs(M51): M-GPU-MOE-1.0 → 1.1.1 cascade SHIPPED + 1.1.2 OPEN paiml/claude-code-parity-apr#39

Merged

4 tasks

This was referenced May 4, 2026

contract+feat+test: v1.2.0 wgpu cascade — option I + 2.0 stub + 2.3 parity test #1485

Merged

feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 #1487

Merged

noahgift mentioned this pull request May 6, 2026

contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL #1530

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type#1464

feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type#1464
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-forward-cuda-m-stage-1-0-redo

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Why this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant