contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path#1453
Merged
Merged
Conversation
…U MoE forward path
Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the
GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md
"NEVER write code before writing a provable contract" — this is the
contract scaffold (M-stage M-GPU-MOE-0 in the contract's
implementation_stages).
Why P0
======
- CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
- Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s.
- MoE inference is ~10× slower than dense, making the spec-prescribed
default Qwen3-Coder model production-infeasible at ~30 tok/s.
- The companion's action-stream parity machinery (CCPA-001..013, all
DISCHARGED) cannot be exercised at production cadence — every
`apr code` invocation hits the 30 tok/s wall.
What this contract specifies
============================
metadata.kind: kernel
status: DRAFT
scope: crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu,
scheduler/moe_dispatch}.rs +
crates/aprender-compute/src/gpu/moe_kernels.rs (TBD)
equations:
- moe_forward_one_layer_gpu (mirrors v1 CPU equation, +cosine-vs-CPU
invariant, +CudaExecutor::new(0).is_ok()
precondition)
- gpu_throughput_target (≥150 tok/s on RTX 4090 over 128-tok
median window, ≥5x CPU baseline)
proof_obligations: 7
AC_GPU_MOE_001 cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC
AC_GPU_MOE_002 router weights sum to 1.0 ± 1e-6
AC_GPU_MOE_003 output dimensions preserved
AC_GPU_MOE_004 output finite (no NaN/Inf)
AC_GPU_MOE_005 cosine ≥ 0.99 vs HF FP16 (inherits from v1)
AC_GPU_MOE_006 ≥150 tok/s on RTX 4090
AC_GPU_MOE_007 VRAM utilization ≤ 95% of 24 GB
falsification_tests: 7
FALSIFY-QW3-MOE-GPU-001 baseline (no GPU symbol)
FALSIFY-QW3-MOE-GPU-PARITY-001 M-GPU-MOE-1 cosine vs CPU
FALSIFY-QW3-MOE-GPU-PARITY-002 M-GPU-MOE-1 cosine vs HF FP16
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite
FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed
FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s
FALSIFY-QW3-MOE-GPU-MEMORY-001 ≤ 95% VRAM
kani_harnesses: 2
KANI-QW3-MOE-GPU-001 router weights sum (AC_GPU_MOE_002)
KANI-QW3-MOE-GPU-002 output shape preservation (AC_GPU_MOE_003)
qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap
quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM)
Implementation stages
=====================
M-GPU-MOE-0 This contract scaffold SHIPPED
M-GPU-MOE-1 CUDA kernel + cosine-vs-CPU parity gate PENDING
M-GPU-MOE-2 wgpu fallback (CLAUDE.md backend-agnostic) PENDING
M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING
When all 3 PENDING stages discharge, status flips DRAFT →
ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention).
Verification
============
$ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.
Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04)
Refs claude-code-parity-apr POC R10 (risk row mirror)
Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling)
Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline)
Refs arXiv:2305.18398 Dao FlashAttention-2
Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE
Refs arXiv:2101.03961 Fedus Switch Transformers
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
contracts/qwen3-moe-forward-gpu-v1.yamlv1.0.0 DRAFTqwen3-moe-forward-v1(CPU LAZY-FUSED-MATVEC)Why P0
Per claude-code-parity-apr POC M49 priority elevation 2026-05-04:
Test plan
arXiv basis
🤖 Generated with Claude Code