Skip to content

contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5)#1462

Merged
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-amendment-option-d
May 4, 2026
Merged

contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5)#1462
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-amendment-option-d

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Contract amendment v1.0.0 → v1.1.0 picking option D (OwnedQuantizedModelCuda) for M-GPU-MOE-1.1 integration architecture
  • Records 4 options enumerated (A/B/C/D) with rejection rationale
  • Mirrors qwen3-moe-forward-v1 v1.2.0 / M32c.2.2.2.1 amendment precedent
  • Retires the wrong-type stub from feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460

Why this amendment

The v1.0.0 contract scaffold was authored from outside the code: specified WHAT GPU MoE means but left WHERE in the type hierarchy unspecified. PR #1460 (M-GPU-MOE-1.0 first-cut stub) placed the function on `OwnedQuantizedModel` — wrong. Code archaeology showed `OwnedQuantizedModelCuda` already exists with the established CPU-attention + CUDA-FFN pattern. The right seam is to extend that type.

Test plan

  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0/0
  • Follow-up PR: relocate stub from `OwnedQuantizedModel::forward_qwen3_moe_gpu` (feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460) to `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda` (M-GPU-MOE-1.0 redo)
  • M-GPU-MOE-1.1 PR: per-expert CUDA dispatch via `self.executor`

🤖 Generated with Claude Code

…ation architecture

Records the architectural-seam decision that gates M-GPU-MOE-1.1 (per-
expert CUDA dispatch). Mirrors the qwen3-moe-forward-v1 v1.2.0
amendment (M32c.2.2.2.1) which picked between three integration
options for the CPU path before any kernel work could land.

The v1.0.0 contract scaffold (M-GPU-MOE-0) was authored from outside
the code: it specified WHAT GPU MoE means but left WHERE in the type
hierarchy unspecified. The first-cut M-GPU-MOE-1.0 stub (PR #1460)
made an implicit choice — placed the function on OwnedQuantizedModel —
that this amendment now overrides as wrong.

Four integration options enumerated
====================================

  (A) Add GPU state directly to OwnedQuantizedModel  REJECTED
      Invasive; touches every CPU-MoE call site.

  (B) Thread &HybridScheduler / &mut GpuModel into
      forward_qwen3_moe_gpu signature                REJECTED
      Breaks signature parity with CPU sibling; forces
      every caller to plumb scheduler state through.

  (C) Spawn transient GpuModel-like helper per call  REJECTED
      Resource thrash on every token; allocates GPU
      buffers in the hot path.

  (D) Mirror existing OwnedQuantizedModelCuda pattern CHOSEN
      Add forward_qwen3_moe_cuda as a method on the
      existing CUDA wrapper type.

Why (D) is chosen
=================

  - OwnedQuantizedModelCuda already exists at
    crates/aprender-serve/src/gguf/cuda/mod.rs:106.
  - Wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers
    (embed_buf, prefix_cache).
  - Existing forward_cuda method (cuda.rs:18) already does
    "CPU attention + CUDA FFN matmul" — the established pattern
    this contract should EXTEND, not invent a new substrate.
  - Pros: Zero new types; reuses CudaExecutor cache, memory-info
    tracking, prefix-cache; signature parity preserved (just on a
    different self type); follows the same precedent that made
    forward_cuda's incremental landing work.

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold (v1.0.0)              SHIPPED ✓
  M-GPU-MOE-0.5  This decision amendment (v1.1.0)        SHIPPED (THIS PR)
  M-GPU-MOE-1.0  Stub on OwnedQuantizedModelCuda         PENDING
                 (relocates the wrong-type stub from #1460)
  M-GPU-MOE-1.1  Per-expert CUDA dispatch via            PENDING
                 self.executor (gemm_q4k for gate/up,
                 gemm_q6k for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99        PENDING
                 (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Throughput ≥150 tok/s + VRAM ≤ 95%      PENDING

Verification
============

  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs M32c.2.2.2.1 (CPU sibling integration-architecture amendment
  precedent in qwen3-moe-forward-v1 v1.2.0)
Refs PR #1460 (the v1.0.0-era M-GPU-MOE-1.0 stub on the wrong type;
  retired by this amendment)
Refs CLAUDE.md "NEVER write code before writing a provable contract"
Refs claude-code-parity-apr POC M49 (P0 elevation)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 09:02
@noahgift noahgift merged commit 4495407 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the contract/qwen3-moe-forward-gpu-v1-amendment-option-d branch May 4, 2026 09:38
noahgift added a commit that referenced this pull request May 4, 2026
…correct type (#1464)

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462
squash 4495407), the GPU MoE forward path lives on
OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut
M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the
wrong type; this PR ships the redo on the correct type.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs  NEW

    impl OwnedQuantizedModelCuda {
        pub fn forward_qwen3_moe_cuda(
            &self,
            token_ids: &[u32],
            moe_layers: &[Qwen3MoeQuantizedLayer],
            num_experts: usize,
            num_experts_per_tok: usize,
            moe_intermediate: usize,
            _data: &[u8],
        ) -> Result<Vec<f32>>
    }

  Behavior at M-GPU-MOE-1.0-redo:
    1. Validate preconditions (token_ids non-empty, moe_layers length
       matches self.model.layers.len(), num_experts/num_experts_per_tok/
       moe_intermediate > 0, num_experts_per_tok ≤ num_experts).
    2. Return RealizarError::UnsupportedOperation pointing at
       qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages
       M-GPU-MOE-1.1+.

  + 1 unit test (signature drift gate)
  + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");`

Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel)
=========================================================

Per the v1.1.0 amendment's option D decision: this method must
extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN
pattern (forward_cuda in cuda.rs at line 18), not invent a new
substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel
+ holds CudaExecutor + GPU buffers (embed_buf, prefix_cache).

Naming follows existing precedent: `forward_cuda` is the existing
method on this type, so `forward_qwen3_moe_cuda` slots in cleanly.

Wrong-type stub (#1460) status
==============================

The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460
remains on main. It returns the same UnsupportedOperation but on
the wrong type. A separate cleanup PR can either delete it or
update its doc-comment to point at this new variant. Not blocking.

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.1  Per-expert CUDA dispatch via            PENDING
                 self.executor (gemm_q4k for gate/up,
                 gemm_q6k for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99        PENDING
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Throughput ≥150 + VRAM ≤ 95%            PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 (P0 elevation)
Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED)
Refs M32b precedent (CPU sibling staging: stub → forward impl)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
… GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1469)

* feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN

Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363)
step-for-step: F32 router on CPU, softmax + top-k + renormalize on
CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda
(M-GPU-MOE-1.1.0), weighted aggregation on CPU.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path
on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives
(q4k_matvec for gate/up, q6k_gemv for down) per expert.

Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure
that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full
integration) will call once per token per layer.

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles

Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0)
Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2 (#1477)

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant