spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED by noahgift · Pull Request #1454 · paiml/aprender

noahgift · 2026-05-04T02:21:37Z

Summary

Spec v2.91.0 → v2.92.0 records that the §46.7(a) follow-up cascade kicked off the same day with three PRs (#1450 + #1451 + #1452).

What §47 records

Subsection	Content
47.1	8-step cascade roadmap (this amendment captures steps 1-3)
47.2	What landed in PRs #1450 + #1451 + #1452
47.3	Toyota Way correction in detail — v1.0.0 → v1.1.0 mid-cascade
47.4	Pre-existing parent contract drift (QPostRope/KPostRope unwired)
47.5	Net effects
47.6	Open follow-ups (5-step ranked priority list)
47.7	Five Whys
47.8	Spec amendment cadence preserved

Net effects

Spec v2.91.0 → v2.92.0
MODEL-1 ship %: unchanged at 91% (cascade is scaffold)
MODEL-2 ship %: unchanged at 57%
Coverage tally: unchanged this cycle (increments when PR contract(trace-attn-sub-stages-v1): v1.1.0 PROPOSED — layer-0 attention bisection plan (2 new SaveTensorStage variants + 9-stage chain) #1450 YAML lands)

Cascade roadmap captured

#	PR	Discharge status
1	#1450	5 falsifiers algorithm-bound
2	#1451	FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL
3	#1452	Research evidence (no falsifier flip)
4-8	(next)	FALSIFY-ATTN-SUB-002..004 + SHIP-007 root-cause fix

Pre-existing drift discovered + recorded

Researching the wire-plan surfaced that `QPostRope` + `KPostRope` are in the parent enum but have NO `emit()` calls in `forward_traced_with_plan`. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstates coverage. Next-cycle FALSIFY-ATTN-SUB-002 PR closes the drift as a free side-effect.

Test plan

CI green on required gates

🤖 Generated with Claude Code

…U MoE forward path Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md "NEVER write code before writing a provable contract" — this is the contract scaffold (M-stage M-GPU-MOE-0 in the contract's implementation_stages). Why P0 ====== - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M. - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s. - MoE inference is ~10× slower than dense, making the spec-prescribed default Qwen3-Coder model production-infeasible at ~30 tok/s. - The companion's action-stream parity machinery (CCPA-001..013, all DISCHARGED) cannot be exercised at production cadence — every `apr code` invocation hits the 30 tok/s wall. What this contract specifies ============================ metadata.kind: kernel status: DRAFT scope: crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu, scheduler/moe_dispatch}.rs + crates/aprender-compute/src/gpu/moe_kernels.rs (TBD) equations: - moe_forward_one_layer_gpu (mirrors v1 CPU equation, +cosine-vs-CPU invariant, +CudaExecutor::new(0).is_ok() precondition) - gpu_throughput_target (≥150 tok/s on RTX 4090 over 128-tok median window, ≥5x CPU baseline) proof_obligations: 7 AC_GPU_MOE_001 cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC AC_GPU_MOE_002 router weights sum to 1.0 ± 1e-6 AC_GPU_MOE_003 output dimensions preserved AC_GPU_MOE_004 output finite (no NaN/Inf) AC_GPU_MOE_005 cosine ≥ 0.99 vs HF FP16 (inherits from v1) AC_GPU_MOE_006 ≥150 tok/s on RTX 4090 AC_GPU_MOE_007 VRAM utilization ≤ 95% of 24 GB falsification_tests: 7 FALSIFY-QW3-MOE-GPU-001 baseline (no GPU symbol) FALSIFY-QW3-MOE-GPU-PARITY-001 M-GPU-MOE-1 cosine vs CPU FALSIFY-QW3-MOE-GPU-PARITY-002 M-GPU-MOE-1 cosine vs HF FP16 FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s FALSIFY-QW3-MOE-GPU-MEMORY-001 ≤ 95% VRAM kani_harnesses: 2 KANI-QW3-MOE-GPU-001 router weights sum (AC_GPU_MOE_002) KANI-QW3-MOE-GPU-002 output shape preservation (AC_GPU_MOE_003) qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM) Implementation stages ===================== M-GPU-MOE-0 This contract scaffold SHIPPED M-GPU-MOE-1 CUDA kernel + cosine-vs-CPU parity gate PENDING M-GPU-MOE-2 wgpu fallback (CLAUDE.md backend-agnostic) PENDING M-GPU-MOE-3 Throughput ≥150 tok/s + VRAM ≤ 95% PENDING When all 3 PENDING stages discharge, status flips DRAFT → ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention). Verification ============ $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml 0 error(s), 0 warning(s) Contract is valid. Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04) Refs claude-code-parity-apr POC R10 (risk row mirror) Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling) Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline) Refs arXiv:2305.18398 Dao FlashAttention-2 Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE Refs arXiv:2101.03961 Fedus Switch Transformers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion bisection cascade STARTED After §46 declared the v0.32.0 cut HOLD-gated on SHIP-007 layer-0 attention, the §46.7(a) follow-up cascade kicked off with three PRs in flight (#1450 + #1451 + #1452). ## What §47 records | Subsection | Content | |---|---| | 47.1 | 8-step cascade roadmap (this amendment captures steps 1-3) | | 47.2 | What landed in PRs #1450 + #1451 + #1452 | | 47.3 | **Toyota Way correction in detail** — v1.0.0 → v1.1.0 mid-cascade | | 47.4 | Pre-existing parent contract drift (QPostRope/KPostRope unwired) | | 47.5 | Net effects (ship %, coverage tally, pending merges) | | 47.6 | Open follow-ups (5-step ranked priority list) | | 47.7 | Five Whys (why amend at 3 PRs, why split §47/§48, etc.) | | 47.8 | Spec amendment cadence preserved (§41 → §47, 7 amendments) | ## Cascade roadmap | # | PR | What | Discharge status | |---|----|------|-------| | 1 | #1450 | Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 PROPOSED | 5 falsifiers algorithm-bound | | 2 | #1451 | Enum extension: 2 new SaveTensorStage variants | FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL | | 3 | #1452 | Research evidence note | No falsifier flip | | 4 | (next) | forward_traced_with_plan wires 4 sub-stages | FALSIFY-ATTN-SUB-002 + drift fix | | 5 | (next) | apr diff --values recognizes new stages | FALSIFY-ATTN-SUB-003 | | 6 | (next) | HF FP16 oracle script extension | unblocks FALSIFY-ATTN-SUB-004 | | 7 | (next) | Live RTX 4090 bisection | FALSIFY-ATTN-SUB-004 → DISCHARGED | | 8 | (next) | SHIP-007 root-cause fix | unblocks MODEL-1 GPU | §47 captures the first 3 (scaffold). §48+ will capture later steps. ## Toyota Way correction (mid-cascade) v1.0.0 of `trace-attn-sub-stages-v1.yaml` was the day's first defect. It claimed 5 new SaveTensorStage variants needed; live source inspection (per `feedback_no_guessing.md`) showed 3 already existed. v1.1.0 corrected to 2 truly-new variants + added the 9-element `bisection_chain_layer_0` equation. Cost-of-defect paid at the contract layer (cheapest place); no code rolled back. ## Pre-existing parent contract drift Researching the wire-plan for FALSIFY-ATTN-SUB-002 surfaced a drift in `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL): `QPostRope` + `KPostRope` are in the enum but have NO `emit()` calls in `forward_traced_with_plan`. A user passing `--save-tensor q_post_rope` gets a clean exit with no file written — silent failure. Per `feedback_toyota_way_all_defects.md`: all defects are mine. The next-cycle FALSIFY-ATTN-SUB-002 PR will close this drift as a free side-effect by wiring the 2 missing stages alongside the 2 new ones. ## Net effects - Spec v2.91.0 → **v2.92.0** - Coverage tally: unchanged this cycle (5 new PARTIAL_ALGORITHM_LEVEL slots will increment when PR #1450 lands the YAML) - MODEL-1 ship %: unchanged at 91% (cascade is scaffold; ship % moves at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle) - MODEL-2 ship %: unchanged at 57% ## Five Whys (compressed) 1. Why amend at 3 PRs? §41-§46 cadence is "one amendment per ≥3-PR cycle" 2. Why split §47/§48? Toyota Way correction is worth pinning 3. Why pin parent drift here, not amend the parent contract? Drift fix lands in next-cycle implementation PR; §47 just records it 4. Why no FALSIFY-ATTN-SUB-002 in this cycle? Single-piece flow; stacked PRs slow merge throughput 5. Why no parent-contract bump now? Bump requires wire fix landing first (FUNCTIONAL claim) — cleaner to bump in next-cycle PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 3 commits May 4, 2026 04:02

Merge branch 'main' into spec/v2-92-ship-007-cascade-started

efcd678

noahgift enabled auto-merge (squash) May 4, 2026 02:50

Merge branch 'main' into spec/v2-92-ship-007-cascade-started

75baf9d

noahgift merged commit 13f48d4 into main May 4, 2026
10 checks passed

noahgift deleted the spec/v2-92-ship-007-cascade-started branch May 4, 2026 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED#1454

spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED#1454
noahgift merged 4 commits into
mainfrom
spec/v2-92-ship-007-cascade-started

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

What §47 records

Net effects

Cascade roadmap captured

Pre-existing drift discovered + recorded

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant