Skip to content

spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED#1454

Merged
noahgift merged 4 commits into
mainfrom
spec/v2-92-ship-007-cascade-started
May 4, 2026
Merged

spec(ship-two-models): v2.92.0 — §47 SHIP-007 layer-0 attention bisection cascade STARTED#1454
noahgift merged 4 commits into
mainfrom
spec/v2-92-ship-007-cascade-started

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Spec v2.91.0 → v2.92.0 records that the §46.7(a) follow-up cascade kicked off the same day with three PRs (#1450 + #1451 + #1452).

What §47 records

Subsection Content
47.1 8-step cascade roadmap (this amendment captures steps 1-3)
47.2 What landed in PRs #1450 + #1451 + #1452
47.3 Toyota Way correction in detail — v1.0.0 → v1.1.0 mid-cascade
47.4 Pre-existing parent contract drift (QPostRope/KPostRope unwired)
47.5 Net effects
47.6 Open follow-ups (5-step ranked priority list)
47.7 Five Whys
47.8 Spec amendment cadence preserved

Net effects

Cascade roadmap captured

# PR Discharge status
1 #1450 5 falsifiers algorithm-bound
2 #1451 FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL
3 #1452 Research evidence (no falsifier flip)
4-8 (next) FALSIFY-ATTN-SUB-002..004 + SHIP-007 root-cause fix

Pre-existing drift discovered + recorded

Researching the wire-plan surfaced that `QPostRope` + `KPostRope` are in the parent enum but have NO `emit()` calls in `forward_traced_with_plan`. The parent contract `apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL) silently overstates coverage. Next-cycle FALSIFY-ATTN-SUB-002 PR closes the drift as a free side-effect.

Test plan

  • CI green on required gates

🤖 Generated with Claude Code

noahgift and others added 3 commits May 4, 2026 04:02
…U MoE forward path

Per claude-code-parity-apr POC M49 priority elevation 2026-05-04, the
GPU MoE forward path is now P0 / HIGHEST PRIORITY. Per CLAUDE.md
"NEVER write code before writing a provable contract" — this is the
contract scaffold (M-stage M-GPU-MOE-0 in the contract's
implementation_stages).

Why P0
======

  - CPU LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s on
    Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
  - Dense GPU Q4_K (Qwen2.5-Coder-7B) on RTX 4090 cuBLAS: 225-440 tok/s.
  - MoE inference is ~10× slower than dense, making the spec-prescribed
    default Qwen3-Coder model production-infeasible at ~30 tok/s.
  - The companion's action-stream parity machinery (CCPA-001..013, all
    DISCHARGED) cannot be exercised at production cadence — every
    `apr code` invocation hits the 30 tok/s wall.

What this contract specifies
============================

  metadata.kind: kernel
  status:        DRAFT
  scope:         crates/aprender-serve/src/gpu/{forward_qwen3_moe_gpu,
                   scheduler/moe_dispatch}.rs +
                 crates/aprender-compute/src/gpu/moe_kernels.rs (TBD)

  equations:
    - moe_forward_one_layer_gpu  (mirrors v1 CPU equation, +cosine-vs-CPU
                                   invariant, +CudaExecutor::new(0).is_ok()
                                   precondition)
    - gpu_throughput_target      (≥150 tok/s on RTX 4090 over 128-tok
                                   median window, ≥5x CPU baseline)

  proof_obligations: 7
    AC_GPU_MOE_001  cosine ≥ 0.99 vs CPU LAZY-FUSED-MATVEC
    AC_GPU_MOE_002  router weights sum to 1.0 ± 1e-6
    AC_GPU_MOE_003  output dimensions preserved
    AC_GPU_MOE_004  output finite (no NaN/Inf)
    AC_GPU_MOE_005  cosine ≥ 0.99 vs HF FP16 (inherits from v1)
    AC_GPU_MOE_006  ≥150 tok/s on RTX 4090
    AC_GPU_MOE_007  VRAM utilization ≤ 95% of 24 GB

  falsification_tests: 7
    FALSIFY-QW3-MOE-GPU-001          baseline (no GPU symbol)
    FALSIFY-QW3-MOE-GPU-PARITY-001   M-GPU-MOE-1 cosine vs CPU
    FALSIFY-QW3-MOE-GPU-PARITY-002   M-GPU-MOE-1 cosine vs HF FP16
    FALSIFY-QW3-MOE-GPU-INVARIANTS-001 router/shape/finite
    FALSIFY-QW3-MOE-GPU-DETERMINISM-001 byte-identical reruns same seed
    FALSIFY-QW3-MOE-GPU-THROUGHPUT-001 ≥150 tok/s
    FALSIFY-QW3-MOE-GPU-MEMORY-001     ≤ 95% VRAM

  kani_harnesses: 2
    KANI-QW3-MOE-GPU-001  router weights sum (AC_GPU_MOE_002)
    KANI-QW3-MOE-GPU-002  output shape preservation (AC_GPU_MOE_003)

  qa_gate: F-QW3-MOE-GPU-001 (5 named checks, falsification = swap
    quantized for EAGER FP32 → guaranteed OOM on 24 GB VRAM)

Implementation stages
=====================

  M-GPU-MOE-0  This contract scaffold                        SHIPPED
  M-GPU-MOE-1  CUDA kernel + cosine-vs-CPU parity gate       PENDING
  M-GPU-MOE-2  wgpu fallback (CLAUDE.md backend-agnostic)    PENDING
  M-GPU-MOE-3  Throughput ≥150 tok/s + VRAM ≤ 95%            PENDING

When all 3 PENDING stages discharge, status flips DRAFT →
ACTIVE_RUNTIME (matches qwen3-moe-forward-v1 v1 convention).

Verification
============

  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs claude-code-parity-apr POC M49 (P0 elevation, 2026-05-04)
Refs claude-code-parity-apr POC R10 (risk row mirror)
Refs qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL (CPU sibling)
Refs apr-cpu-vs-gpu-output-parity-v1 (CPU↔GPU parity discipline)
Refs arXiv:2305.18398 Dao FlashAttention-2
Refs arXiv:2305.05176 Aminabadi DeepSpeed-MoE
Refs arXiv:2101.03961 Fedus Switch Transformers

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion bisection cascade STARTED

After §46 declared the v0.32.0 cut HOLD-gated on SHIP-007 layer-0 attention,
the §46.7(a) follow-up cascade kicked off with three PRs in flight (#1450 +
#1451 + #1452).

## What §47 records

| Subsection | Content |
|---|---|
| 47.1 | 8-step cascade roadmap (this amendment captures steps 1-3) |
| 47.2 | What landed in PRs #1450 + #1451 + #1452 |
| 47.3 | **Toyota Way correction in detail** — v1.0.0 → v1.1.0 mid-cascade |
| 47.4 | Pre-existing parent contract drift (QPostRope/KPostRope unwired) |
| 47.5 | Net effects (ship %, coverage tally, pending merges) |
| 47.6 | Open follow-ups (5-step ranked priority list) |
| 47.7 | Five Whys (why amend at 3 PRs, why split §47/§48, etc.) |
| 47.8 | Spec amendment cadence preserved (§41 → §47, 7 amendments) |

## Cascade roadmap

| # | PR | What | Discharge status |
|---|----|------|-------|
| 1 | #1450 | Contract `trace-attn-sub-stages-v1.yaml` v1.1.0 PROPOSED | 5 falsifiers algorithm-bound |
| 2 | #1451 | Enum extension: 2 new SaveTensorStage variants | FALSIFY-ATTN-SUB-001 PARTIAL_ALGORITHM_LEVEL |
| 3 | #1452 | Research evidence note | No falsifier flip |
| 4 | (next) | forward_traced_with_plan wires 4 sub-stages | FALSIFY-ATTN-SUB-002 + drift fix |
| 5 | (next) | apr diff --values recognizes new stages | FALSIFY-ATTN-SUB-003 |
| 6 | (next) | HF FP16 oracle script extension | unblocks FALSIFY-ATTN-SUB-004 |
| 7 | (next) | Live RTX 4090 bisection | FALSIFY-ATTN-SUB-004 → DISCHARGED |
| 8 | (next) | SHIP-007 root-cause fix | unblocks MODEL-1 GPU |

§47 captures the first 3 (scaffold). §48+ will capture later steps.

## Toyota Way correction (mid-cascade)

v1.0.0 of `trace-attn-sub-stages-v1.yaml` was the day's first defect. It
claimed 5 new SaveTensorStage variants needed; live source inspection
(per `feedback_no_guessing.md`) showed 3 already existed. v1.1.0 corrected
to 2 truly-new variants + added the 9-element `bisection_chain_layer_0`
equation. Cost-of-defect paid at the contract layer (cheapest place); no
code rolled back.

## Pre-existing parent contract drift

Researching the wire-plan for FALSIFY-ATTN-SUB-002 surfaced a drift in
`apr-cli-trace-save-tensor-v1.yaml` v1.4.0 (FUNCTIONAL): `QPostRope` +
`KPostRope` are in the enum but have NO `emit()` calls in
`forward_traced_with_plan`. A user passing `--save-tensor q_post_rope`
gets a clean exit with no file written — silent failure.

Per `feedback_toyota_way_all_defects.md`: all defects are mine. The
next-cycle FALSIFY-ATTN-SUB-002 PR will close this drift as a free
side-effect by wiring the 2 missing stages alongside the 2 new ones.

## Net effects

- Spec v2.91.0 → **v2.92.0**
- Coverage tally: unchanged this cycle (5 new PARTIAL_ALGORITHM_LEVEL
  slots will increment when PR #1450 lands the YAML)
- MODEL-1 ship %: unchanged at 91% (cascade is scaffold; ship % moves
  at FALSIFY-ATTN-SUB-004 LIVE DISCHARGE in a future cycle)
- MODEL-2 ship %: unchanged at 57%

## Five Whys (compressed)

1. Why amend at 3 PRs? §41-§46 cadence is "one amendment per ≥3-PR cycle"
2. Why split §47/§48? Toyota Way correction is worth pinning
3. Why pin parent drift here, not amend the parent contract? Drift fix
   lands in next-cycle implementation PR; §47 just records it
4. Why no FALSIFY-ATTN-SUB-002 in this cycle? Single-piece flow;
   stacked PRs slow merge throughput
5. Why no parent-contract bump now? Bump requires wire fix landing first
   (FUNCTIONAL claim) — cleaner to bump in next-cycle PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 02:50
@noahgift noahgift merged commit 13f48d4 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the spec/v2-92-ship-007-cascade-started branch May 4, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant