Skip to content

contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path#1453

Merged
noahgift merged 4 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-scaffold
May 4, 2026
Merged

contract(qwen3-moe-forward-gpu-v1): v1.0.0 DRAFT scaffold — P0 GPU MoE forward path#1453
noahgift merged 4 commits into
mainfrom
contract/qwen3-moe-forward-gpu-v1-scaffold

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New kernel contract contracts/qwen3-moe-forward-gpu-v1.yaml v1.0.0 DRAFT
  • GPU sibling of qwen3-moe-forward-v1 (CPU LAZY-FUSED-MATVEC)
  • 7 proof obligations + 7 falsification tests + 2 kani harnesses + qa_gate
  • Validates clean: `pv validate` 0/0
  • M-stage M-GPU-MOE-0 SHIPPED (this contract is the deliverable)

Why P0

Per claude-code-parity-apr POC M49 priority elevation 2026-05-04:

  • CPU LAZY-FUSED-MATVEC: ~30 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
  • Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s
  • MoE is ~10× slower than dense — production-infeasible at ~30 tok/s
  • The companion's CCPA-001..013 parity machinery is correct but cannot be exercised at production cadence

Test plan

  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0/0
  • M-GPU-MOE-1: CUDA kernel `forward_qwen3_moe_gpu` + FALSIFY-QW3-MOE-GPU-PARITY-001 (cosine ≥ 0.99 vs CPU) — separate PR
  • M-GPU-MOE-2: wgpu fallback — separate PR
  • M-GPU-MOE-3: throughput ≥150 tok/s + VRAM ≤ 95% — separate PR

arXiv basis

  • arXiv:2305.18398 Dao, FlashAttention-2
  • arXiv:2305.05176 Aminabadi et al., DeepSpeed-MoE
  • arXiv:2101.03961 Fedus et al., Switch Transformers

🤖 Generated with Claude Code

…U MoE forward path

Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the
GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md
"NEVER write code before writing a provable contract" — this is the
contract scaffold (M-stage M-GPU-MOE-0 in the contract's
implementation_stages).

Why P0
======

  - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on
    Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
  - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s.
  - MoE inference is ~10× slower than dense, making the spec-prescribed
    default Qwen3-Coder model production-infeasible at ~30 tok/s.
  - The companion's action-stream parity machinery (CCPA-001..013, all
    DISCHARGED) cannot be exercised at production cadence — every
    `apr code` invocation hits the 30 tok/s wall.

What this contract specifies
============================

  metadata.kind: kernel
  status:        DRAFT
  scope:         crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu,
                   scheduler/moe_dispatch}.rs +
                 crates/aprender-compute/src/gpu/moe_kernels.rs (TBD)

  equations:
    - moe_forward_one_layer_gpu  (mirrors v1 CPU equation, +cosine-vs-CPU
                                   invariant, +CudaExecutor::new(0).is_ok()
                                   precondition)
    - gpu_throughput_target      (≥150 tok/s on RTX 4090 over 128-tok
                                   median window, ≥5x CPU baseline)

  proof_obligations: 7
    AC_GPU_MOE_001  cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC
    AC_GPU_MOE_002  router weights sum to 1.0 ± 1e-6
    AC_GPU_MOE_003  output dimensions preserved
    AC_GPU_MOE_004  output finite (no NaN/Inf)
    AC_GPU_MOE_005  cosine ≥ 0.99 vs HF FP16 (inherits from v1)
    AC_GPU_MOE_006  ≥150 tok/s on RTX 4090
    AC_GPU_MOE_007  VRAM utilization ≤ 95% of 24 GB

  falsification_tests: 7
    FALSIFY-QW3-MOE-GPU-001          baseline (no GPU symbol)
    FALSIFY-QW3-MOE-GPU-PARITY-001   M-GPU-MOE-1 cosine vs CPU
    FALSIFY-QW3-MOE-GPU-PARITY-002   M-GPU-MOE-1 cosine vs HF FP16
    FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite
    FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed
    FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s
    FALSIFY-QW3-MOE-GPU-MEMORY-001     ≤ 95% VRAM

  kani_harnesses: 2
    KANI-QW3-MOE-GPU-001  router weights sum (AC_GPU_MOE_002)
    KANI-QW3-MOE-GPU-002  output shape preservation (AC_GPU_MOE_003)

  qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap
    quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM)

Implementation stages
=====================

  M-GPU-MOE-0  This contract scaffold                        SHIPPED
  M-GPU-MOE-1  CUDA kernel + cosine-vs-CPU parity gate       PENDING
  M-GPU-MOE-2  wgpu fallback (CLAUDE.md backend-agnostic)    PENDING
  M-GPU-MOE-3  Throughput ≥150 tok/s + VRAM ≤ 95%            PENDING

When all 3 PENDING stages discharge, status flips DRAFT →
ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention).

Verification
============

  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04)
Refs claude-code-parity-apr POC R10 (risk row mirror)
Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling)
Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline)
Refs arXiv:2305.18398 Dao FlashAttention-2
Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE
Refs arXiv:2101.03961 Fedus Switch Transformers

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 02:03
@noahgift noahgift merged commit cf08e91 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the contract/qwen3-moe-forward-gpu-v1-scaffold branch May 4, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant