spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED#1454
Merged
Merged
Conversation
…U MoE forward path
Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the
GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md
"NEVER write code before writing a provable contract" — this is the
contract scaffold (M-stage M-GPU-MOE-0 in the contract's
implementation_stages).
Why P0
======
- CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
- Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s.
- MoE inference is ~10× slower than dense, making the spec-prescribed
default Qwen3-Coder model production-infeasible at ~30 tok/s.
- The companion's action-stream parity machinery (CCPA-001..013, all
DISCHARGED) cannot be exercised at production cadence — every
`apr code` invocation hits the 30 tok/s wall.
What this contract specifies
============================
metadata.kind: kernel
status: DRAFT
scope: crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu,
scheduler/moe_dispatch}.rs +
crates/aprender-compute/src/gpu/moe_kernels.rs (TBD)
equations:
- moe_forward_one_layer_gpu (mirrors v1 CPU equation, +cosine-vs-CPU
invariant, +CudaExecutor::new(0).is_ok()
precondition)
- gpu_throughput_target (≥150 tok/s on RTX 4090 over 128-tok
median window, ≥5x CPU baseline)
proof_obligations: 7
AC_GPU_MOE_001 cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC
AC_GPU_MOE_002 router weights sum to 1.0 ± 1e-6
AC_GPU_MOE_003 output dimensions preserved
AC_GPU_MOE_004 output finite (no NaN/Inf)
AC_GPU_MOE_005 cosine ≥ 0.99 vs HF FP16 (inherits from v1)
AC_GPU_MOE_006 ≥150 tok/s on RTX 4090
AC_GPU_MOE_007 VRAM utilization ≤ 95% of 24 GB
falsification_tests: 7
FALSIFY-QW3-MOE-GPU-001 baseline (no GPU symbol)
FALSIFY-QW3-MOE-GPU-PARITY-001 M-GPU-MOE-1 cosine vs CPU
FALSIFY-QW3-MOE-GPU-PARITY-002 M-GPU-MOE-1 cosine vs HF FP16
FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite
FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed
FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s
FALSIFY-QW3-MOE-GPU-MEMORY-001 ≤ 95% VRAM
kani_harnesses: 2
KANI-QW3-MOE-GPU-001 router weights sum (AC_GPU_MOE_002)
KANI-QW3-MOE-GPU-002 output shape preservation (AC_GPU_MOE_003)
qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap
quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM)
Implementation stages
=====================
M-GPU-MOE-0 This contract scaffold SHIPPED
M-GPU-MOE-1 CUDA kernel + cosine-vs-CPU parity gate PENDING
M-GPU-MOE-2 wgpu fallback (CLAUDE.md backend-agnostic) PENDING
M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING
When all 3 PENDING stages discharge, status flips DRAFT →
ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention).
Verification
============
$ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.
Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04)
Refs claude-code-parity-apr POC R10 (risk row mirror)
Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling)
Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline)
Refs arXiv:2305.18398 Dao FlashAttention-2
Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE
Refs arXiv:2101.03961 Fedus Switch Transformers
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion bisection cascade STARTED After §46 declared the v0.32.0 cut HOLD-gated on SHIP-007 layer-0 attention, the §46.7(a) follow-up cascade kicked off with three PRs in flight (#1450 + #1451 + #1452). ## What §47 records | Subsection | Content | |---|---| | 47.1 | 8-step cascade roadmap (this amendment captures steps 1-3) | | 47.2 | What landed in PRs #1450 + #1451 + #1452 | | 47.3 | **Toyota Way correction in detail** — v1.0.0 → v1.1.0 mid-cascade | | 47.4 | Pre-existing parent contract drift (QPostRope/KPostRope unwired) | | 47.5 | Net effects (ship %, coverage tally, pending merges) | | 47.6 | Open follow-ups (5-step ranked priority list) | | 47.7 | Five Whys (why amend at 3 PRs, why split §47/§48, etc.) | | 47.8 | Spec amendment cadence preserved (§41 → §47, 7 amendments) | ## Cascade roadmap | # | PR | What | Discharge status | |---|----|------|-------| | 1 | #1450 | Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 PROPOSED | 5 falsifiers algorithm-bound | | 2 | #1451 | Enum extension: 2 new SaveTensorStage variants | FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL | | 3 | #1452 | Research evidence note | No falsifier flip | | 4 | (next) | forward_traced_with_plan wires 4 sub-stages | FALSIFY-ATTN-SUB-002 + drift fix | | 5 | (next) | apr diff --values recognizes new stages | FALSIFY-ATTN-SUB-003 | | 6 | (next) | HF FP16 oracle script extension | unblocks FALSIFY-ATTN-SUB-004 | | 7 | (next) | Live RTX 4090 bisection | FALSIFY-ATTN-SUB-004 → DISCHARGED | | 8 | (next) | SHIP-007 root-cause fix | unblocks MODEL-1 GPU | §47 captures the first 3 (scaffold). §48+ will capture later steps. ## Toyota Way correction (mid-cascade) v1.0.0 of `trace-attn-sub-stages-v1.yaml` was the day's first defect. It claimed 5 new SaveTensorStage variants needed; live source inspection (per `feedback_no_guessing.md`) showed 3 already existed. v1.1.0 corrected to 2 truly-new variants + added the 9-element `bisection_chain_layer_0` equation. Cost-of-defect paid at the contract layer (cheapest place); no code rolled back. ## Pre-existing parent contract drift Researching the wire-plan for FALSIFY-ATTN-SUB-002 surfaced a drift in `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL): `QPostRope` + `KPostRope` are in the enum but have NO `emit()` calls in `forward_traced_with_plan`. A user passing `--save-tensor q_post_rope` gets a clean exit with no file written — silent failure. Per `feedback_toyota_way_all_defects.md`: all defects are mine. The next-cycle FALSIFY-ATTN-SUB-002 PR will close this drift as a free side-effect by wiring the 2 missing stages alongside the 2 new ones. ## Net effects - Spec v2.91.0 → **v2.92.0** - Coverage tally: unchanged this cycle (5 new PARTIAL_ALGORITHM_LEVEL slots will increment when PR #1450 lands the YAML) - MODEL-1 ship %: unchanged at 91% (cascade is scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% ## Five Whys (compressed) 1. Why amend at 3 PRs? §41-§46 cadence is "one amendment per ≥3-PR cycle" 2. Why split §47/§48? Toyota Way correction is worth pinning 3. Why pin parent drift here, not amend the parent contract? Drift fix lands in next-cycle implementation PR; §47 just records it 4. Why no FALSIFY-ATTN-SUB-002 in this cycle? Single-piece flow; stacked PRs slow merge throughput 5. Why no parent-contract bump now? Bump requires wire fix landing first (FUNCTIONAL claim) — cleaner to bump in next-cycle PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Spec v2.91.0 → v2.92.0 records that the §46.7(a) follow-up cascade kicked off the same day with three PRs (#1450 + #1451 + #1452).
What §47 records
Net effects
Cascade roadmap captured
Pre-existing drift discovered + recorded
Researching the wire-plan surfaced that `QPostRope` + `KPostRope` are in the parent enum but have NO `emit()` calls in `forward_traced_with_plan`. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstates coverage. Next-cycle FALSIFY-ATTN-SUB-002 PR closes the drift as a free side-effect.
Test plan
🤖 Generated with Claude Code