Skip to content

feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465

Closed
noahgift wants to merge 2 commits into
mainfrom
feat/expert-swiglu-cuda-m-stage-1-1-0
Closed

feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465
noahgift wants to merge 2 commits into
mainfrom
feat/expert-swiglu-cuda-m-stage-1-1-0

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New `expert_swiglu_cuda` helper — first concrete GPU compute for qwen3-moe-forward-gpu-v1
  • Mirrors CPU per-expert SwiGLU body in `moe_ffn_forward_layer` (qwen3_moe_load.rs:363)
  • Routes 3 matmuls per expert through existing CudaExecutor primitives (`q4k_matvec` × 2 + `q6k_gemv`)
  • 2 unit tests pass on RTX 4090

Why this stage

Before integrating per-expert GPU dispatch into the full forward loop (M-GPU-MOE-1.1.1), this helper makes the kernel call site testable in isolation. Same M32c.2.2.* CPU staging precedent.

Implementation

For one expert, one token:

```
gate_out = q4k_matvec(gate_bytes, hidden, m=intermediate, k=hidden_dim)
up_out = q4k_matvec(up_bytes, hidden, m=intermediate, k=hidden_dim)
ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU)
expert_out = q6k_gemv(down_bytes, ffn_inner, n=hidden_dim, k=intermediate)
```

Naive per-expert dispatch (M-GPU-MOE-1.1.0 baseline) — the fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. Contract stages correctness before performance.

Test plan

  • `cargo check -p aprender-serve --features cuda` — compiles
  • `cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda` — 2/2 pass
  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` — 0/0
  • M-GPU-MOE-1.1.1 PR: full `forward_qwen3_moe_cuda` integration calling this helper
  • M-GPU-MOE-1.2 PR: cosine-vs-CPU parity gate against real GGUF

🤖 Generated with Claude Code

… GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/expert-swiglu-cuda-m-stage-1-1-0 branch from c4d52bd to 1dffe93 Compare May 4, 2026 14:08
noahgift added a commit that referenced this pull request May 4, 2026
…ngle-layer GPU MoE FFN

Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363)
step-for-step: F32 router on CPU, softmax + top-k + renormalize on
CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda
(M-GPU-MOE-1.1.0), weighted aggregation on CPU.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path
on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives
(q4k_matvec for gate/up, q6k_gemv for down) per expert.

Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure
that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full
integration) will call once per token per layer.

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles

Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0)
Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1469)

* feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN

Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363)
step-for-step: F32 router on CPU, softmax + top-k + renormalize on
CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda
(M-GPU-MOE-1.1.0), weighted aggregation on CPU.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path
on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives
(q4k_matvec for gate/up, q6k_gemv for down) per expert.

Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure
that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full
integration) will call once per token per layer.

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles

Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0)
Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Closing — content shipped via #1469 squash 77b9f0d which contained both M-GPU-MOE-1.1.0 (expert_swiglu_cuda) and M-GPU-MOE-1.1.1 (moe_ffn_forward_layer_cuda) as a single squash commit. Both helper files now on main.

@noahgift noahgift closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant