feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub#1460
Merged
Conversation
First sub-stage of M-GPU-MOE-1 per qwen3-moe-forward-gpu-v1 v1.0.0 DRAFT (landed 2026-05-04 squash cf08e91, M-GPU-MOE-0). Mirrors the M32a → M32b → M32c.* CPU staging pattern. What this PR ships ================== crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_gpu.rs (NEW) pub fn OwnedQuantizedModel::forward_qwen3_moe_gpu( token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data, ) -> Result<Vec<f32>> Behavior at M-GPU-MOE-1.0: 1. Validate preconditions (token_ids non-empty, moe_layers length matches, num_experts/num_experts_per_tok/moe_intermediate > 0, num_experts_per_tok ≤ num_experts) — SAME boundary as the CPU sibling forward_qwen3_moe. 2. Return RealizarError::UnsupportedOperation { operation: "forward_qwen3_moe_gpu", reason: <points at qwen3-moe-forward-gpu-v1.yaml + lists pending stages M-GPU-MOE-1.1+ + tells caller to use forward_qwen3_moe (CPU LAZY-FUSED-MATVEC) for now> } Same precedent as M32b's RealizarError::UnsupportedOperation { operation: "moe_forward_dispatch" } from the CPU sibling staging. + 1 unit test (compilation gate on signature drift) + module wired in mod.rs Why a stub is useful (even though it doesn't compute) ====================================================== 1. Establishes the function signature that downstream callers (run_qwen3_moe_generate_gpu, apr run --backend cuda) will use, so plumbing PRs can land in parallel with the kernel PR. 2. Returns a structured error that names the contract, so any caller hitting it gets a precise pointer to the open work (mirror of M32b's discharge of FALSIFY-QW3-MOE-FORWARD-002). 3. Pins the contract's M-GPU-MOE-1 stage status from PENDING to PARTIAL_ALGORITHM_LEVEL — the function exists, just doesn't compute anything yet. Staging plan (in the contract's implementation_stages) ======================================================= M-GPU-MOE-0 Contract scaffold SHIPPED ✓ (cf08e91) M-GPU-MOE-1.0 This stub SHIPPED ✓ (THIS PR) M-GPU-MOE-1.1 Per-expert dispatch via existing dense PENDING GPU primitives (Q4_K cuBLAS for gate/up, Q6_K cuBLAS for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused dequant+matmul + sparse expert PENDING batching → ≥150 tok/s on RTX 4090 Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_gpu test ... ok. 1 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs claude-code-parity-apr POC M49 (P0 elevation) Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED) Refs qwen3-moe-forward-gpu-v1 v1.0.0 DRAFT (kernel contract) Refs M32b precedent (CPU sibling staging: load-aware error → forward impl) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…ation architecture (#1462) Records the architectural-seam decision that gates M-GPU-MOE-1.1 (per- expert CUDA dispatch). Mirrors the qwen3-moe-forward-v1 v1.2.0 amendment (M32c.2.2.2.1) which picked between three integration options for the CPU path before any kernel work could land. The v1.0.0 contract scaffold (M-GPU-MOE-0) was authored from outside the code: it specified WHAT GPU MoE means but left WHERE in the type hierarchy unspecified. The first-cut M-GPU-MOE-1.0 stub (PR #1460) made an implicit choice — placed the function on OwnedQuantizedModel — that this amendment now overrides as wrong. Four integration options enumerated ==================================== (A) Add GPU state directly to OwnedQuantizedModel REJECTED Invasive; touches every CPU-MoE call site. (B) Thread &HybridScheduler / &mut GpuModel into forward_qwen3_moe_gpu signature REJECTED Breaks signature parity with CPU sibling; forces every caller to plumb scheduler state through. (C) Spawn transient GpuModel-like helper per call REJECTED Resource thrash on every token; allocates GPU buffers in the hot path. (D) Mirror existing OwnedQuantizedModelCuda pattern CHOSEN Add forward_qwen3_moe_cuda as a method on the existing CUDA wrapper type. Why (D) is chosen ================= - OwnedQuantizedModelCuda already exists at crates/aprender-serve/src/gguf/cuda/mod.rs:106. - Wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers (embed_buf, prefix_cache). - Existing forward_cuda method (cuda.rs:18) already does "CPU attention + CUDA FFN matmul" — the established pattern this contract should EXTEND, not invent a new substrate. - Pros: Zero new types; reuses CudaExecutor cache, memory-info tracking, prefix-cache; signature parity preserved (just on a different self type); follows the same precedent that made forward_cuda's incremental landing work. Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold (v1.0.0) SHIPPED ✓ M-GPU-MOE-0.5 This decision amendment (v1.1.0) SHIPPED (THIS PR) M-GPU-MOE-1.0 Stub on OwnedQuantizedModelCuda PENDING (relocates the wrong-type stub from #1460) M-GPU-MOE-1.1 Per-expert CUDA dispatch via PENDING self.executor (gemm_q4k for gate/up, gemm_q6k for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING Verification ============ $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs M32c.2.2.2.1 (CPU sibling integration-architecture amendment precedent in qwen3-moe-forward-v1 v1.2.0) Refs PR #1460 (the v1.0.0-era M-GPU-MOE-1.0 stub on the wrong type; retired by this amendment) Refs CLAUDE.md "NEVER write code before writing a provable contract" Refs claude-code-parity-apr POC M49 (P0 elevation) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…correct type (#1464) Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462 squash 4495407), the GPU MoE forward path lives on OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the wrong type; this PR ships the redo on the correct type. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs NEW impl OwnedQuantizedModelCuda { pub fn forward_qwen3_moe_cuda( &self, token_ids: &[u32], moe_layers: &[Qwen3MoeQuantizedLayer], num_experts: usize, num_experts_per_tok: usize, moe_intermediate: usize, _data: &[u8], ) -> Result<Vec<f32>> } Behavior at M-GPU-MOE-1.0-redo: 1. Validate preconditions (token_ids non-empty, moe_layers length matches self.model.layers.len(), num_experts/num_experts_per_tok/ moe_intermediate > 0, num_experts_per_tok ≤ num_experts). 2. Return RealizarError::UnsupportedOperation pointing at qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages M-GPU-MOE-1.1+. + 1 unit test (signature drift gate) + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");` Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel) ========================================================= Per the v1.1.0 amendment's option D decision: this method must extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs at line 18), not invent a new substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers (embed_buf, prefix_cache). Naming follows existing precedent: `forward_cuda` is the existing method on this type, so `forward_qwen3_moe_cuda` slots in cleanly. Wrong-type stub (#1460) status ============================== The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460 remains on main. It returns the same UnsupportedOperation but on the wrong type. A separate cleanup PR can either delete it or update its doc-comment to point at this new variant. Not blocking. Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (THIS PR) M-GPU-MOE-1.1 Per-expert CUDA dispatch via PENDING self.executor (gemm_q4k for gate/up, gemm_q6k for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 (P0 elevation) Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED) Refs M32b precedent (CPU sibling staging: stub → forward impl) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 4, 2026
Merged
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…QuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…arity test (#1485) * contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu) Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu fallback). Mirrors the v1.1.0 option D amendment that pinned the CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins the wgpu substrate before any wgpu code lands. Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED, 1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484). Choosing the wgpu seam early prevents the wrong-type-stub waste that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on OwnedQuantizedModel; one cycle later #1464 redo'd it on OwnedQuantizedModelCuda — option D). FOUR options considered: (I) OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN (II) GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered) (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive) (IV) Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate) Option I picks wgpu by code-path symmetry, not by trait abstraction: new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors `crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode reviewer can verify a parity bug by diff, not by elaborate test infrastructure. M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x: M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body) M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu Two new blockers documented: - wgpu adapter selection probe for non-NVIDIA hardware - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1 Companion-spec records this as M52 (no companion contract bump). Validation: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s). Contract is valid. Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487) Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I (see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for the wgpu backend. WHAT THIS PR ADDS: * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module with OwnedQuantizedModelWgpu struct + new() + stub method forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure. * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors cuda_model.rs. * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208). WHY MODULE NAMED `wgpu_backend`: The Rust ecosystem already has a `wgpu` crate. A module named `wgpu` inside the same crate would shadow it inside the file's body. The public re-export still presents `OwnedQuantizedModelWgpu` (no ugly suffix) thanks to wgpu_model.rs. WHY THIS IS A STUB: Same staging discipline as M-GPU-MOE-1.0-redo — contract first, scaffold second, implementation third. The body of forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda sibling's boundary) then returns RealizarError::UnsupportedOperation whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2 staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU LAZY-FUSED-MATVEC, ~30 tok/s). VERIFICATION: cargo check -p aprender-serve → 0 errors (default) cargo check -p aprender-serve --features cuda → 0 errors (cuda) cargo check -p aprender-serve --features gpu → 0 errors (wgpu) cargo test -p aprender-serve --lib --features gpu \ owned_quantized_model_wgpu_tests → 1 passed Lib unit test asserts the function signature exists and matches the cuda sibling step-for-step (compile-time checks via fn pointer coercion — no runtime model construction needed at the stub stage). DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I amendment). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR rebases onto main directly. NEXT STAGES per v1.2.0: M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu) M-GPU-MOE-2.2 full forward integration mirror of cuda sibling M-GPU-MOE-2.3 cosine-vs-CPU parity test on wgpu hardware Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488) wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484). Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu` integration on the same prompt. Same falsifier ID as the cuda sibling (FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend implementing the same contract gate, not a different gate. Same threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same 3-token canonical prompt as the cuda test. CI WIRING: - #[cfg(feature = "gpu")] gates the file (matches the gate on OwnedQuantizedModelWgpu in gguf/mod.rs) - #[ignore] on the heavy test (CI default skips; explicit `--include-ignored` runs it on a wgpu-capable adapter — Apple Silicon Metal, AMD Vulkan, Intel ARC Vulkan) - 2 helper unit tests (cosine_similarity sanity coverage) DO run by default WHEN THE TEST PASSES: - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test currently panics at the wgpu forward call (correct behaviour for a falsifier against an incomplete impl). - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2 (full forward integration analog of forward_qwen3_moe_cuda) must both land before this test passes on hardware. - On hardware with wgpu support, run with --include-ignored to exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for the wgpu backend (cuda backend discharged by sibling test). DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub). Branch is stacked on the v1.2.0 contract branch; once #1485 lands on main, this PR's base flips to main automatically. Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OwnedQuantizedModel::forward_qwen3_moe_gpufunction in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_gpu.rs`Staging context
This is M-GPU-MOE-1.0 — first sub-stage of M-GPU-MOE-1 per the contract scaffold landed in #1453 (squash `cf08e910f`).
Why this is P0 (companion POC M49)
CPU LAZY-FUSED-MATVEC: ~30 tok/s. Dense GPU Q4_K: 225-440 tok/s on RTX 4090. MoE inference is ~10× slower than dense — Qwen3-Coder-30B-A3B-Instruct-Q4_K_M default model production-infeasible at ~30 tok/s.
Test plan
🤖 Generated with Claude Code