Skip to content

feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub#1460

Merged
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-forward-gpu-m-stage-1-0
May 4, 2026
Merged

feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub#1460
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-forward-gpu-m-stage-1-0

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New OwnedQuantizedModel::forward_qwen3_moe_gpu function in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_gpu.rs`
  • Validates same preconditions as CPU sibling `forward_qwen3_moe` (token_ids non-empty, moe_layers length match, MoE config > 0, etc.)
  • Returns `RealizarError::UnsupportedOperation { operation: "forward_qwen3_moe_gpu", reason: }`
  • Same M32b precedent (CPU sibling staged through this exact pattern)
  • 1 unit test (signature drift gate)

Staging context

This is M-GPU-MOE-1.0 — first sub-stage of M-GPU-MOE-1 per the contract scaffold landed in #1453 (squash `cf08e910f`).

Stage Status
M-GPU-MOE-0 contract scaffold SHIPPED ✓ (#1453)
M-GPU-MOE-1.0 stub THIS PR
M-GPU-MOE-1.1 per-expert GPU dispatch PENDING
M-GPU-MOE-1.2 cosine-vs-CPU parity ≥0.99 PENDING
M-GPU-MOE-2 wgpu fallback PENDING
M-GPU-MOE-3 throughput ≥150 tok/s PENDING

Why this is P0 (companion POC M49)

CPU LAZY-FUSED-MATVEC: ~30 tok/s. Dense GPU Q4_K: 225-440 tok/s on RTX 4090. MoE inference is ~10× slower than dense — Qwen3-Coder-30B-A3B-Instruct-Q4_K_M default model production-infeasible at ~30 tok/s.

Test plan

  • `cargo check -p aprender-serve --features cuda` — compiles
  • `cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_gpu` — passes
  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` — 0/0
  • M-GPU-MOE-1.1 PR: implement per-expert dispatch via existing dense GPU primitives (separate PR)

🤖 Generated with Claude Code

First sub-stage of M-GPU-MOE-1 per qwen3-moe-forward-gpu-v1 v1.0.0
DRAFT (landed 2026-05-04 squash cf08e91, M-GPU-MOE-0). Mirrors the
M32a → M32b → M32c.* CPU staging pattern.

What this PR ships
==================

  crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_gpu.rs (NEW)
    pub fn OwnedQuantizedModel::forward_qwen3_moe_gpu(
        token_ids,
        moe_layers,
        num_experts,
        num_experts_per_tok,
        moe_intermediate,
        data,
    ) -> Result<Vec<f32>>

  Behavior at M-GPU-MOE-1.0:
    1. Validate preconditions (token_ids non-empty, moe_layers length
       matches, num_experts/num_experts_per_tok/moe_intermediate > 0,
       num_experts_per_tok ≤ num_experts) — SAME boundary as the CPU
       sibling forward_qwen3_moe.
    2. Return RealizarError::UnsupportedOperation {
         operation: "forward_qwen3_moe_gpu",
         reason: <points at qwen3-moe-forward-gpu-v1.yaml + lists
                  pending stages M-GPU-MOE-1.1+ + tells caller to use
                  forward_qwen3_moe (CPU LAZY-FUSED-MATVEC) for now>
       }

  Same precedent as M32b's RealizarError::UnsupportedOperation {
  operation: "moe_forward_dispatch" } from the CPU sibling staging.

  + 1 unit test (compilation gate on signature drift)
  + module wired in mod.rs

Why a stub is useful (even though it doesn't compute)
======================================================

  1. Establishes the function signature that downstream callers
     (run_qwen3_moe_generate_gpu, apr run --backend cuda) will use,
     so plumbing PRs can land in parallel with the kernel PR.
  2. Returns a structured error that names the contract, so any
     caller hitting it gets a precise pointer to the open work
     (mirror of M32b's discharge of FALSIFY-QW3-MOE-FORWARD-002).
  3. Pins the contract's M-GPU-MOE-1 stage status from PENDING to
     PARTIAL_ALGORITHM_LEVEL — the function exists, just doesn't
     compute anything yet.

Staging plan (in the contract's implementation_stages)
=======================================================

  M-GPU-MOE-0    Contract scaffold                       SHIPPED ✓ (cf08e91)
  M-GPU-MOE-1.0  This stub                               SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.1  Per-expert dispatch via existing dense   PENDING
                 GPU primitives (Q4_K cuBLAS for gate/up,
                 Q6_K cuBLAS for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99         PENDING
                 (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Fused dequant+matmul + sparse expert     PENDING
                 batching → ≥150 tok/s on RTX 4090

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_gpu
  test ... ok. 1 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs claude-code-parity-apr POC M49 (P0 elevation)
Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED)
Refs qwen3-moe-forward-gpu-v1 v1.0.0 DRAFT (kernel contract)
Refs M32b precedent (CPU sibling staging: load-aware error → forward impl)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 07:44
@noahgift noahgift merged commit 4d9e5ae into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/qwen3-moe-forward-gpu-m-stage-1-0 branch May 4, 2026 09:00
noahgift added a commit that referenced this pull request May 4, 2026
…ation architecture (#1462)

Records the architectural-seam decision that gates M-GPU-MOE-1.1 (per-
expert CUDA dispatch). Mirrors the qwen3-moe-forward-v1 v1.2.0
amendment (M32c.2.2.2.1) which picked between three integration
options for the CPU path before any kernel work could land.

The v1.0.0 contract scaffold (M-GPU-MOE-0) was authored from outside
the code: it specified WHAT GPU MoE means but left WHERE in the type
hierarchy unspecified. The first-cut M-GPU-MOE-1.0 stub (PR #1460)
made an implicit choice — placed the function on OwnedQuantizedModel —
that this amendment now overrides as wrong.

Four integration options enumerated
====================================

  (A) Add GPU state directly to OwnedQuantizedModel  REJECTED
      Invasive; touches every CPU-MoE call site.

  (B) Thread &HybridScheduler / &mut GpuModel into
      forward_qwen3_moe_gpu signature                REJECTED
      Breaks signature parity with CPU sibling; forces
      every caller to plumb scheduler state through.

  (C) Spawn transient GpuModel-like helper per call  REJECTED
      Resource thrash on every token; allocates GPU
      buffers in the hot path.

  (D) Mirror existing OwnedQuantizedModelCuda pattern CHOSEN
      Add forward_qwen3_moe_cuda as a method on the
      existing CUDA wrapper type.

Why (D) is chosen
=================

  - OwnedQuantizedModelCuda already exists at
    crates/aprender-serve/src/gguf/cuda/mod.rs:106.
  - Wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers
    (embed_buf, prefix_cache).
  - Existing forward_cuda method (cuda.rs:18) already does
    "CPU attention + CUDA FFN matmul" — the established pattern
    this contract should EXTEND, not invent a new substrate.
  - Pros: Zero new types; reuses CudaExecutor cache, memory-info
    tracking, prefix-cache; signature parity preserved (just on a
    different self type); follows the same precedent that made
    forward_cuda's incremental landing work.

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold (v1.0.0)              SHIPPED ✓
  M-GPU-MOE-0.5  This decision amendment (v1.1.0)        SHIPPED (THIS PR)
  M-GPU-MOE-1.0  Stub on OwnedQuantizedModelCuda         PENDING
                 (relocates the wrong-type stub from #1460)
  M-GPU-MOE-1.1  Per-expert CUDA dispatch via            PENDING
                 self.executor (gemm_q4k for gate/up,
                 gemm_q6k for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99        PENDING
                 (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Throughput ≥150 tok/s + VRAM ≤ 95%      PENDING

Verification
============

  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs M32c.2.2.2.1 (CPU sibling integration-architecture amendment
  precedent in qwen3-moe-forward-v1 v1.2.0)
Refs PR #1460 (the v1.0.0-era M-GPU-MOE-1.0 stub on the wrong type;
  retired by this amendment)
Refs CLAUDE.md "NEVER write code before writing a provable contract"
Refs claude-code-parity-apr POC M49 (P0 elevation)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…correct type (#1464)

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462
squash 4495407), the GPU MoE forward path lives on
OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut
M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the
wrong type; this PR ships the redo on the correct type.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs  NEW

    impl OwnedQuantizedModelCuda {
        pub fn forward_qwen3_moe_cuda(
            &self,
            token_ids: &[u32],
            moe_layers: &[Qwen3MoeQuantizedLayer],
            num_experts: usize,
            num_experts_per_tok: usize,
            moe_intermediate: usize,
            _data: &[u8],
        ) -> Result<Vec<f32>>
    }

  Behavior at M-GPU-MOE-1.0-redo:
    1. Validate preconditions (token_ids non-empty, moe_layers length
       matches self.model.layers.len(), num_experts/num_experts_per_tok/
       moe_intermediate > 0, num_experts_per_tok ≤ num_experts).
    2. Return RealizarError::UnsupportedOperation pointing at
       qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages
       M-GPU-MOE-1.1+.

  + 1 unit test (signature drift gate)
  + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");`

Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel)
=========================================================

Per the v1.1.0 amendment's option D decision: this method must
extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN
pattern (forward_cuda in cuda.rs at line 18), not invent a new
substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel
+ holds CudaExecutor + GPU buffers (embed_buf, prefix_cache).

Naming follows existing precedent: `forward_cuda` is the existing
method on this type, so `forward_qwen3_moe_cuda` slots in cleanly.

Wrong-type stub (#1460) status
==============================

The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460
remains on main. It returns the same UnsupportedOperation but on
the wrong type. A separate cleanup PR can either delete it or
update its doc-comment to point at this new variant. Not blocking.

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.1  Per-expert CUDA dispatch via            PENDING
                 self.executor (gemm_q4k for gate/up,
                 gemm_q6k for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99        PENDING
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Throughput ≥150 + VRAM ≤ 95%            PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 (P0 elevation)
Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED)
Refs M32b precedent (CPU sibling staging: stub → forward impl)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…arity test (#1485)

* contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487)

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488)

wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484).
Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference
and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu`
integration on the same prompt.

Same falsifier ID as the cuda sibling
(FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend
implementing the same contract gate, not a different gate. Same
threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same
3-token canonical prompt as the cuda test.

CI WIRING:

  - #[cfg(feature = "gpu")] gates the file (matches the gate on
    OwnedQuantizedModelWgpu in gguf/mod.rs)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it on a wgpu-capable adapter — Apple
    Silicon Metal, AMD Vulkan, Intel ARC Vulkan)
  - 2 helper unit tests (cosine_similarity sanity coverage) DO run
    by default

WHEN THE TEST PASSES:

  - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test
    currently panics at the wgpu forward call (correct behaviour
    for a falsifier against an incomplete impl).
  - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu
    QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2
    (full forward integration analog of forward_qwen3_moe_cuda)
    must both land before this test passes on hardware.
  - On hardware with wgpu support, run with --include-ignored to
    exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for
    the wgpu backend (cuda backend discharged by sibling test).

DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub).
Branch is stacked on the v1.2.0 contract branch; once #1485 lands
on main, this PR's base flips to main automatically.

Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 ::
M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant