feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465
Closed
noahgift wants to merge 2 commits into
Closed
feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper#1465noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
2 tasks
… GPU helper
Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.
What this PR ships
==================
crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW
pub(crate) fn expert_swiglu_cuda(
executor: &mut crate::cuda::CudaExecutor,
gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim]
up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim]
down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate]
hidden: &[f32],
hidden_dim: usize,
intermediate: usize,
) -> Result<Vec<f32>>
Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim)
3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise)
4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)
+ 2 unit tests (signature drift gate + InvalidShape rejection)
Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================
The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:
M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓
- silu via CPU elementwise (small)
- element-wise gate*up via CPU
- matmuls via existing q4k/q6k GPU kernels
M-GPU-MOE-1.1.1 Full forward integration in
OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
(router + per-token loop + per-expert
dispatch + weighted aggregation) PENDING
M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING
(FALSIFY-QW3-MOE-GPU-PARITY-001)
M-GPU-MOE-2 wgpu fallback PENDING
M-GPU-MOE-3 Fused kernels + sparse batching PENDING
Verification
============
$ cargo check -p aprender-serve --features cuda
✓ Compiles
$ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
test ... ok. 2 passed; 0 failed
$ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
0 error(s), 0 warning(s)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c4d52bd to
1dffe93
Compare
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…ngle-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…1469) * feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why this stage
Before integrating per-expert GPU dispatch into the full forward loop (M-GPU-MOE-1.1.1), this helper makes the kernel call site testable in isolation. Same M32c.2.2.* CPU staging precedent.
Implementation
For one expert, one token:
```
gate_out = q4k_matvec(gate_bytes, hidden, m=intermediate, k=hidden_dim)
up_out = q4k_matvec(up_bytes, hidden, m=intermediate, k=hidden_dim)
ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU)
expert_out = q6k_gemv(down_bytes, ffn_inner, n=hidden_dim, k=intermediate)
```
Naive per-expert dispatch (M-GPU-MOE-1.1.0 baseline) — the fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. Contract stages correctness before performance.
Test plan
🤖 Generated with Claude Code