contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path by noahgift · Pull Request #1453 · paiml/aprender

noahgift · 2026-05-04T02:03:17Z

Summary

New kernel contract contracts/qwen3-moe-forward-gpu-v1.yaml v1.0.0 DRAFT
GPU sibling of qwen3-moe-forward-v1 (CPU LAZY-FUSED-MATVEC)
7 proof obligations + 7 falsification tests + 2 kani harnesses + qa_gate
Validates clean: `pv validate` 0/0
M-stage M-GPU-MOE-0 SHIPPED (this contract is the deliverable)

Why P0

Per claude-code-parity-apr POC M49 priority elevation 2026-05-04:

CPU LAZY-FUSED-MATVEC: ~30 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s
MoE is ~10× slower than dense — production-infeasible at ~30 tok/s
The companion's CCPA-001..013 parity machinery is correct but cannot be exercised at production cadence

Test plan

`pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0/0
M-GPU-MOE-1: CUDA kernel `forward_qwen3_moe_gpu` + FALSIFY-QW3-MOE-GPU-PARITY-001 (cosine ≥ 0.99 vs CPU) — separate PR
M-GPU-MOE-2: wgpu fallback — separate PR
M-GPU-MOE-3: throughput ≥150 tok/s + VRAM ≤ 95% — separate PR

arXiv basis

arXiv:2305.18398 Dao, FlashAttention-2
arXiv:2305.05176 Aminabadi et al., DeepSpeed-MoE
arXiv:2101.03961 Fedus et al., Switch Transformers

🤖 Generated with Claude Code

…U MoE forward path Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md "NEVER write code before writing a provable contract" — this is the contract scaffold (M-stage M-GPU-MOE-0 in the contract's implementation_stages). Why P0 ====== - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M. - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s. - MoE inference is ~10× slower than dense, making the spec-prescribed default Qwen3-Coder model production-infeasible at ~30 tok/s. - The companion's action-stream parity machinery (CCPA-001..013, all DISCHARGED) cannot be exercised at production cadence — every `apr code` invocation hits the 30 tok/s wall. What this contract specifies ============================ metadata.kind: kernel status: DRAFT scope: crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu, scheduler/moe_dispatch}.rs + crates/aprender-compute/src/gpu/moe_kernels.rs (TBD) equations: - moe_forward_one_layer_gpu (mirrors v1 CPU equation, +cosine-vs-CPU invariant, +CudaExecutor::new(0).is_ok() precondition) - gpu_throughput_target (≥150 tok/s on RTX 4090 over 128-tok median window, ≥5x CPU baseline) proof_obligations: 7 AC_GPU_MOE_001 cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC AC_GPU_MOE_002 router weights sum to 1.0 ± 1e-6 AC_GPU_MOE_003 output dimensions preserved AC_GPU_MOE_004 output finite (no NaN/Inf) AC_GPU_MOE_005 cosine ≥ 0.99 vs HF FP16 (inherits from v1) AC_GPU_MOE_006 ≥150 tok/s on RTX 4090 AC_GPU_MOE_007 VRAM utilization ≤ 95% of 24 GB falsification_tests: 7 FALSIFY-QW3-MOE-GPU-001 baseline (no GPU symbol) FALSIFY-QW3-MOE-GPU-PARITY-001 M-GPU-MOE-1 cosine vs CPU FALSIFY-QW3-MOE-GPU-PARITY-002 M-GPU-MOE-1 cosine vs HF FP16 FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s FALSIFY-QW3-MOE-GPU-MEMORY-001 ≤ 95% VRAM kani_harnesses: 2 KANI-QW3-MOE-GPU-001 router weights sum (AC_GPU_MOE_002) KANI-QW3-MOE-GPU-002 output shape preservation (AC_GPU_MOE_003) qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM) Implementation stages ===================== M-GPU-MOE-0 This contract scaffold SHIPPED M-GPU-MOE-1 CUDA kernel + cosine-vs-CPU parity gate PENDING M-GPU-MOE-2 wgpu fallback (CLAUDE.md backend-agnostic) PENDING M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING When all 3 PENDING stages discharge, status flips DRAFT → ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention). Verification ============ $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04) Refs claude-code-parity-apr POC R10 (risk row mirror) Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling) Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline) Refs arXiv:2305.18398 Dao FlashAttention-2 Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE Refs arXiv:2101.03961 Fedus Switch Transformers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 02:03

noahgift added 3 commits May 4, 2026 04:34

Merge branch 'main' into contract/qwen3-moe-forward-gpu-v1-scaffold

7433933

Merge branch 'main' into contract/qwen3-moe-forward-gpu-v1-scaffold

179fdb4

Merge branch 'main' into contract/qwen3-moe-forward-gpu-v1-scaffold

332a1e7

noahgift merged commit cf08e91 into main May 4, 2026
10 checks passed

noahgift deleted the contract/qwen3-moe-forward-gpu-v1-scaffold branch May 4, 2026 04:56

noahgift mentioned this pull request May 4, 2026

feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path#1453

contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path#1453
noahgift merged 4 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-scaffold

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Why P0

Test plan

arXiv basis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant