contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5) by noahgift · Pull Request #1462 · paiml/aprender

noahgift · 2026-05-04T09:02:28Z

Summary

Contract amendment v1.0.0 → v1.1.0 picking option D (OwnedQuantizedModelCuda) for M-GPU-MOE-1.1 integration architecture
Records 4 options enumerated (A/B/C/D) with rejection rationale
Mirrors qwen3-moe-forward-v1 v1.2.0 / M32c.2.2.2.1 amendment precedent
Retires the wrong-type stub from feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460

Why this amendment

The v1.0.0 contract scaffold was authored from outside the code: specified WHAT GPU MoE means but left WHERE in the type hierarchy unspecified. PR #1460 (M-GPU-MOE-1.0 first-cut stub) placed the function on `OwnedQuantizedModel` — wrong. Code archaeology showed `OwnedQuantizedModelCuda` already exists with the established CPU-attention + CUDA-FFN pattern. The right seam is to extend that type.

Test plan

`pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0/0
Follow-up PR: relocate stub from `OwnedQuantizedModel::forward_qwen3_moe_gpu` (feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460) to `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda` (M-GPU-MOE-1.0 redo)
M-GPU-MOE-1.1 PR: per-expert CUDA dispatch via `self.executor`

🤖 Generated with Claude Code

…ation architecture Records the architectural-seam decision that gates M-GPU-MOE-1.1 (per- expert CUDA dispatch). Mirrors the qwen3-moe-forward-v1 v1.2.0 amendment (M32c.2.2.2.1) which picked between three integration options for the CPU path before any kernel work could land. The v1.0.0 contract scaffold (M-GPU-MOE-0) was authored from outside the code: it specified WHAT GPU MoE means but left WHERE in the type hierarchy unspecified. The first-cut M-GPU-MOE-1.0 stub (PR #1460) made an implicit choice — placed the function on OwnedQuantizedModel — that this amendment now overrides as wrong. Four integration options enumerated ==================================== (A) Add GPU state directly to OwnedQuantizedModel REJECTED Invasive; touches every CPU-MoE call site. (B) Thread &HybridScheduler / &mut GpuModel into forward_qwen3_moe_gpu signature REJECTED Breaks signature parity with CPU sibling; forces every caller to plumb scheduler state through. (C) Spawn transient GpuModel-like helper per call REJECTED Resource thrash on every token; allocates GPU buffers in the hot path. (D) Mirror existing OwnedQuantizedModelCuda pattern CHOSEN Add forward_qwen3_moe_cuda as a method on the existing CUDA wrapper type. Why (D) is chosen ================= - OwnedQuantizedModelCuda already exists at crates/aprender-serve/src/gguf/cuda/mod.rs:106. - Wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers (embed_buf, prefix_cache). - Existing forward_cuda method (cuda.rs:18) already does "CPU attention + CUDA FFN matmul" — the established pattern this contract should EXTEND, not invent a new substrate. - Pros: Zero new types; reuses CudaExecutor cache, memory-info tracking, prefix-cache; signature parity preserved (just on a different self type); follows the same precedent that made forward_cuda's incremental landing work. Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold (v1.0.0) SHIPPED ✓ M-GPU-MOE-0.5 This decision amendment (v1.1.0) SHIPPED (THIS PR) M-GPU-MOE-1.0 Stub on OwnedQuantizedModelCuda PENDING (relocates the wrong-type stub from #1460) M-GPU-MOE-1.1 Per-expert CUDA dispatch via PENDING self.executor (gemm_q4k for gate/up, gemm_q6k for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING Verification ============ $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs M32c.2.2.2.1 (CPU sibling integration-architecture amendment precedent in qwen3-moe-forward-v1 v1.2.0) Refs PR #1460 (the v1.0.0-era M-GPU-MOE-1.0 stub on the wrong type; retired by this amendment) Refs CLAUDE.md "NEVER write code before writing a provable contract" Refs claude-code-parity-apr POC M49 (P0 elevation) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…option-d

…correct type (#1464) Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462 squash 4495407), the GPU MoE forward path lives on OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the wrong type; this PR ships the redo on the correct type. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs NEW impl OwnedQuantizedModelCuda { pub fn forward_qwen3_moe_cuda( &self, token_ids: &[u32], moe_layers: &[Qwen3MoeQuantizedLayer], num_experts: usize, num_experts_per_tok: usize, moe_intermediate: usize, _data: &[u8], ) -> Result<Vec<f32>> } Behavior at M-GPU-MOE-1.0-redo: 1. Validate preconditions (token_ids non-empty, moe_layers length matches self.model.layers.len(), num_experts/num_experts_per_tok/ moe_intermediate > 0, num_experts_per_tok ≤ num_experts). 2. Return RealizarError::UnsupportedOperation pointing at qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages M-GPU-MOE-1.1+. + 1 unit test (signature drift gate) + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");` Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel) ========================================================= Per the v1.1.0 amendment's option D decision: this method must extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs at line 18), not invent a new substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel + holds CudaExecutor + GPU buffers (embed_buf, prefix_cache). Naming follows existing precedent: `forward_cuda` is the existing method on this type, so `forward_qwen3_moe_cuda` slots in cleanly. Wrong-type stub (#1460) status ============================== The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460 remains on main. It returns the same UnsupportedOperation but on the wrong type. A separate cleanup PR can either delete it or update its doc-comment to point at this new variant. Not blocking. Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (THIS PR) M-GPU-MOE-1.1 Per-expert CUDA dispatch via PENDING self.executor (gemm_q4k for gate/up, gemm_q6k for down) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 (P0 elevation) Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED) Refs M32b precedent (CPU sibling staging: stub → forward impl) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1469) * feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU compute for the contract. Mirrors the M32c.2.2.* CPU staging where per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the full moe_ffn_forward_layer integration. What this PR ships ================== crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs NEW pub(crate) fn expert_swiglu_cuda( executor: &mut crate::cuda::CudaExecutor, gate_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] up_bytes: &[u8], // Q4_K, [intermediate, hidden_dim] down_bytes: &[u8], // Q6_K, [hidden_dim, intermediate] hidden: &[f32], hidden_dim: usize, intermediate: usize, ) -> Result<Vec<f32>> Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop): 1. gate_out = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim) 2. up_out = executor.q4k_matvec(up_bytes, hidden, ..., m=intermediate, k=hidden_dim) 3. ffn_inner[i] = silu(gate_out[i]) * up_out[i] (CPU element-wise) 4. expert_out = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate) + 2 unit tests (signature drift gate + InvalidShape rejection) Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline =============================================================== The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3. The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages correctness before performance: M-GPU-MOE-1.1.0 (this) Per-expert via existing primitives SHIPPED ✓ - silu via CPU elementwise (small) - element-wise gate*up via CPU - matmuls via existing q4k/q6k GPU kernels M-GPU-MOE-1.1.1 Full forward integration in OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (router + per-token loop + per-expert dispatch + weighted aggregation) PENDING M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Fused kernels + sparse batching PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda test ... ok. 2 passed; 0 failed $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda) Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363) step-for-step: F32 router on CPU, softmax + top-k + renormalize on CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda (M-GPU-MOE-1.1.0), weighted aggregation on CPU. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives (q4k_matvec for gate/up, q6k_gemv for down) per expert. Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full integration) will call once per token per layer. Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0) Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent) Refs claude-code-parity-apr POC M49 / R10 (P0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-MOE-1.1.2 Replaces the M-GPU-MOE-1.0-redo stub body with the full forward integration. forward_qwen3_moe_cuda now mirrors the CPU sibling OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs) line-for-line, with one difference: the per-layer FFN section routes through moe_ffn_forward_layer_cuda which dispatches per- expert matmuls to self.executor (CudaExecutor) via the expert_swiglu_cuda helper. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel for ~5× throughput. Signature changes ================= - &self → &mut self (executor needs mutable for kernel cache) - _data → data (passed to moe_ffn_forward_layer_cuda for expert_byte_slice) Forward body structure (mirrors CPU sibling step-for-step): 1. Embed (CPU) — self.model.embed 2. Per-layer: 2a. Attention norm (CPU) — ops::rms_norm 2b. QKV projection (CPU) — self.model.qkv_matmul 2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b) — ops::apply_per_head_rms_norm 2d. Causal attention + output proj (CPU) — self.model.causal_attention 2e. Residual — element-wise CPU 2f. Pre-FFN norm (CPU) — ops::rms_norm 2g. **MoE FFN on GPU** — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → self.executor.q4k_matvec .q6k_gemv 2h. Residual — element-wise CPU 3. Final norm (CPU) 4. LM head — last token (CPU) Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (#1464) M-GPU-MOE-1.1.0 expert_swiglu_cuda helper SHIPPED ✓ (via #1469 squash) M-GPU-MOE-1.1.1 moe_ffn_forward_layer_cuda SHIPPED ✓ (#1469) M-GPU-MOE-1.1.2 forward_qwen3_moe_cuda full integ SHIPPED ✓ (THIS PR) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed Refs PR #1469 squash 77b9f0d (helpers landed) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-MOE-1.1.2 (#1477) Replaces the M-GPU-MOE-1.0-redo stub body with the full forward integration. forward_qwen3_moe_cuda now mirrors the CPU sibling OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs) line-for-line, with one difference: the per-layer FFN section routes through moe_ffn_forward_layer_cuda which dispatches per- expert matmuls to self.executor (CudaExecutor) via the expert_swiglu_cuda helper. Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel for ~5× throughput. Signature changes ================= - &self → &mut self (executor needs mutable for kernel cache) - _data → data (passed to moe_ffn_forward_layer_cuda for expert_byte_slice) Forward body structure (mirrors CPU sibling step-for-step): 1. Embed (CPU) — self.model.embed 2. Per-layer: 2a. Attention norm (CPU) — ops::rms_norm 2b. QKV projection (CPU) — self.model.qkv_matmul 2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b) — ops::apply_per_head_rms_norm 2d. Causal attention + output proj (CPU) — self.model.causal_attention 2e. Residual — element-wise CPU 2f. Pre-FFN norm (CPU) — ops::rms_norm 2g. **MoE FFN on GPU** — moe_ffn_forward_layer_cuda → expert_swiglu_cuda → self.executor.q4k_matvec .q6k_gemv 2h. Residual — element-wise CPU 3. Final norm (CPU) 4. LM head — last token (CPU) Implementation stages updated ============================= M-GPU-MOE-0 Contract scaffold v1.0.0 SHIPPED ✓ M-GPU-MOE-0.5 v1.1.0 option D amendment SHIPPED ✓ M-GPU-MOE-1.0-redo Stub on OwnedQuantizedModelCuda SHIPPED ✓ (#1464) M-GPU-MOE-1.1.0 expert_swiglu_cuda helper SHIPPED ✓ (via #1469 squash) M-GPU-MOE-1.1.1 moe_ffn_forward_layer_cuda SHIPPED ✓ (#1469) M-GPU-MOE-1.1.2 forward_qwen3_moe_cuda full integ SHIPPED ✓ (THIS PR) M-GPU-MOE-1.2 Cosine-vs-CPU parity gate ≥0.99 PENDING (FALSIFY-QW3-MOE-GPU-PARITY-001) M-GPU-MOE-2 wgpu fallback PENDING M-GPU-MOE-3 Throughput ≥150 + VRAM ≤ 95% PENDING Verification ============ $ cargo check -p aprender-serve --features cuda ✓ Compiles $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda test ... ok. 1 passed Refs PR #1469 squash 77b9f0d (helpers landed) Refs PR #1462 squash 4495407 (v1.1.0 option D amendment) Refs claude-code-parity-apr POC M49 / R10 (P0 elevation) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 09:02

Merge branch 'main' into contract/qwen3-moe-forward-gpu-v1-amendment-…

c878fec

…option-d

noahgift merged commit 4495407 into main May 4, 2026
10 checks passed

noahgift deleted the contract/qwen3-moe-forward-gpu-v1-amendment-option-d branch May 4, 2026 09:38

noahgift mentioned this pull request May 4, 2026

feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type #1464

Merged

4 tasks

noahgift mentioned this pull request May 4, 2026

docs(M51): M-GPU-MOE-1.0 → 1.1.1 cascade SHIPPED + 1.1.2 OPEN paiml/claude-code-parity-apr#39

Merged

4 tasks

noahgift mentioned this pull request May 6, 2026

contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL #1530

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5)#1462

contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5)#1462
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-amendment-option-d

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Why this amendment

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant