Skip to content

spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc#1826

Merged
noahgift merged 4 commits into
mainfrom
spec/m32d-moe-kv-cache-scope
May 19, 2026
Merged

spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc#1826
noahgift merged 4 commits into
mainfrom
spec/m32d-moe-kv-cache-scope

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Scope doc + operator decision matrix for M32d (KV cache on the qwen3_moe inference path). This is the upstream blocker for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.1.1.

Not an implementation PR. Documents the work + estimates effort/risk so operator can choose go/no-go without re-deriving the analysis.

Why now

The empirical evidence chain (5 Phase 6 dispatches across the post-#1789 fix series — #1806, #1812, #1814, #1819) shows that the 30B-MoE student fails uniformly at the per-turn timeout, regardless of how the timeouts are tuned. Root cause: full-prefill-per-token at ~0.5 tok/s. Symptom-class progression documented in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-2026-05-19.md.

The dense path already has `OwnedQuantizedKVCache` + `forward_single_with_cache`. The MoE path has neither.

What this scope covers

  • Dense-path reference: file refs + line numbers for the existing dense KV cache integration
  • API inventory: `OwnedQuantizedKVCache` is sufficient as-is; no struct changes needed
  • 5 implementation steps: function skeleton → attention helper lift → MoE FFN helper lift → generate-loop wire → tests
  • 6 risk surfaces: numerical equivalence, dense path regression, RoPE position offset, GQA shapes, expert routing under cache, free streaming SSE
  • Effort estimate: 8 focused engineering hours total
  • 3 operator decisions: greenlight in-session vs engineer-driven follow-up vs skip

What this does NOT cover

  • Perf tuning beyond 5-15 tok/s baseline
  • Streaming SSE delivery (natural follow-up; one-line addition once KV cache lands)
  • GPU MoE acceleration (separate `qwen3-moe-forward-gpu-v1` contract + M-GPU-MOE-2.x track)

Test plan

  • doc-only change; no code touched
  • CI: doc/spec markdown lint (if configured)

🤖 Generated with Claude Code

noahgift and others added 4 commits May 19, 2026 22:50
…decision doc

Scopes the M32d work that's currently blocking
contracts/qwen3-moe-serve-dispatch-v1.yaml V1_004 (CCPA Phase 6 bench
non-zero student pass rate against Qwen3-Coder-30B-A3B).

Empirical finding (paiml/claude-code-parity-apr Phase 6, 5 dispatches
across the post-#1789 fix chain): 30B-MoE full-prefill-per-token at
~0.5 tok/s cannot fit any reasonable per-turn budget. The dense path
already has `OwnedQuantizedKVCache` + `forward_single_with_cache`; the
MoE path has neither and re-runs the whole prompt on every token.

This scope doc:
- Surveys the dense KV cache code path (file refs + line numbers)
- Inventories the OwnedQuantizedKVCache API (sufficient as-is)
- Lays out 5 implementation steps (function skeleton → attention helper
  lift → MoE FFN helper lift → generate-loop wire → tests)
- Identifies 6 risk surfaces (numerical equivalence, dense regression,
  RoPE offset, GQA shapes, expert routing under cache, streaming SSE)
- Estimates 8 focused engineering hours total
- Presents three operator decisions (greenlight in-session vs
  engineer-driven follow-up vs skip)

NOT an implementation PR. Documents the work so operator can choose
go/no-go without re-deriving the analysis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 33aee24 into main May 19, 2026
10 checks passed
@noahgift noahgift deleted the spec/m32d-moe-kv-cache-scope branch May 19, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant