feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper by noahgift · Pull Request #1465 · paiml/aprender

noahgift · 2026-05-04T10:55:50Z

Summary

New `expert_swiglu_cuda` helper — first concrete GPU compute for qwen3-moe-forward-gpu-v1
Mirrors CPU per-expert SwiGLU body in `moe_ffn_forward_layer` (qwen3_moe_load.rs:363)
Routes 3 matmuls per expert through existing CudaExecutor primitives (`q4k_matvec` × 2 + `q6k_gemv`)
2 unit tests pass on RTX 4090

Why this stage

Before integrating per-expert GPU dispatch into the full forward loop (M-GPU-MOE-1.1.1), this helper makes the kernel call site testable in isolation. Same M32c.2.2.* CPU staging precedent.

Implementation

For one expert, one token:

```
gate_out = q4k_matvec(gate_bytes, hidden, m=intermediate, k=hidden_dim)
up_out = q4k_matvec(up_bytes, hidden, m=intermediate, k=hidden_dim)
ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU)
expert_out = q6k_gemv(down_bytes, ffn_inner, n=hidden_dim, k=intermediate)
```

Naive per-expert dispatch (M-GPU-MOE-1.1.0 baseline) — the fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. Contract stages correctness before performance.

Test plan

`cargo check -p aprender-serve --features cuda` — compiles
`cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda` — 2/2 pass
`pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` — 0/0
M-GPU-MOE-1.1.1 PR: full `forward_qwen3_moe_cuda` integration calling this helper
M-GPU-MOE-1.2 PR: cosine-vs-CPU parity gate against real GGUF

🤖 Generated with Claude Code

… GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ngle-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1469) * feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-04T15:36:14Z

Closing — content shipped via #1469 squash 77b9f0d which contained both M-GPU-MOE-1.1.0 (expert_swiglu_cuda) and M-GPU-MOE-1.1.1 (moe_ffn_forward_layer_cuda) as a single squash commit. Both helper files now on main.

noahgift enabled auto-merge (squash) May 4, 2026 10:55

noahgift mentioned this pull request May 4, 2026

feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 #1469

Merged

2 tasks

noahgift force-pushed the feat/expert-swiglu-cuda-m-stage-1-1-0 branch from c4d52bd to 1dffe93 Compare May 4, 2026 14:08

Merge branch 'main' into feat/expert-swiglu-cuda-m-stage-1-1-0

c9c9af3

noahgift closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465

feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465
noahgift wants to merge 2 commits into
mainfrom
feat/expert-swiglu-cuda-m-stage-1-1-0

noahgift commented May 4, 2026

Uh oh!

noahgift commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Why this stage

Implementation

Test plan

Uh oh!

noahgift commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant